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PREFACE 


N 1988 the writer was requested by the Committee on Meas- 

urement and Guidance of the American Council on Edu- 

cation to undertake an exploratory survey of interest ex- 
pressed, at various schools and colleges, in the general topic of 
educational aptitude tests, their nature and their practical pos- 
sibilities for counseling purposes. During the course of this 
inquiry, twenty-one colleges or universities and a larger number 
of secondary schools, both public and private, were consulted. 
Тһе localities visited were distributed among sixteen states in 
the Northeastern, North Central and Southeastern areas. In 
addition, considerable correspondence was exchanged with edu- 
cators at other institutions. 

A detailed report of this survey, submitted to the American 
Council in September, 1939, indicated a widespread and—at 
least within the areas represented—a well-nigh universal desire 
for improved means of forecasting individuals’ relative promise 
for differential fields of study. Particularly cordial was the re- 
sponse to a tentative proposal for centrally organized coopera- 
tive research along these lines, in the hope that pooled efforts 
and the analysis of results obtained from a number of localities 
would serve the double purpose: (a) of providing more exten- 
Sive and thorough try-out of various existent testing instru- 
ments of the *differential aptitude" types than has heretofore 
been possible; and (b) of adapting these, or producing new 
devices, to meet recognized guidance needs in certain directions 
as yet largely unexplored. Much of the factual material col- 
lected in this survey dealt rather fully with existent objective- 
testing programs or other means of educational measurement 
employed at the time by the various institutions visited. Since 
many new developments have later occurred, especially under 
the impact of war, these details now seem rather extraneous; 
to the ensuing discussion. ^ 

As a result of the interest in such possibilities, revealed by 
that preliminary survey, I was subsequently requested to pre- 
pare a “Manual” on the subject of educational aptitude tests 
for use by student counselors, particularly at senior high school 
and college freshman levels. That seemed, at the moment, a rea- 
sonable and not excessively difficult assignment ; it turned out to 
be far more than anyone, including myself, had anticipated. In 
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fact, I should scarcely have ventured upon this project but for 
a naive conceit, then, of general familiarity with the topic. The 
more I have studied its many ramifications since, the more chas- 
tened, quite frankly, I have become. 

In 1942 a preliminary draft of the manual was distributed 
for criticism to some fifty educational and psychological testing 
authorities. That draft was mimeographed at Washington by 
the American Council on Education under quite difficult cir- 
cumstances and unavoidable delay, due to shortage in clerical 
staff. Moreover, there was no opportunity to proofread the 
mimeoscript, so that numerous errors in typing and pagination 
were uncorrected. 

Nevertheless, comments thereon were generally favorable 
and, from certain critics, especially helpful. Their interest and 
encouragement are gratefully acknowledged. These led to com- 
plete revision of the originally projected, fairly simple manual 
toward a volume of considerably enlarged scope, embodying 
additional detail and technical references. Further complica- 
tions resulting from war-torn years have much impeded this 
process; but new data collected throughout such unexpectedly 
protracted time for revision will, it is hoped, increase the value 
of our commentary. . 

Meanwhile it has become evident that current developments 
in aptitude and other testing procedures are going forward at 
so rapid a pace that any book attempting to keep up with them 
at the latest moment would never quite reach the printer. In- 
quiries resulting from private circulation of the initial draft, 
as to when its revision would become more widely available, have 
revealed so much interest in this topic of educational aptitude 
tests that each successive major part will be published upon 
completion. This should, in turn, provide opportunity for sub- 
sequent revisions at some later date, in the light of still further 
criticism. 

Not only was there much to learn in this expanded study 
about experiments elsewhere with materials analogous to those 
employed at Yale; there was even more to be learned about 
other fields not originally contemplated for survey. Various sug- 
gestions were advanced by critics of the preliminary draft for 
expansion of the earlier plan to include review of aptitude test- 
ing in aesthetics, professional and vocational areas. Divisions 
and subdivisions of the relatively simple structure first con- 
ceived have frequently been made, as one chapter after another 
“got out of bounds." A discussion of achievement tests and 
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their functional relationship to those of the aptitude type 
seemed advisable. The effect of interest and personality factors 
upon student performance—and therefore, retroactively, upon 
the means of appraising or forecasting such performance— 
could not be overlooked; and so the matter grew, by a sort of 
“binary fission.” Unlike the amoeba, however, these successive 
divisions are both complex and differential. 

Evolution of this book has proceeded not so much by intent 
as because the subject itself—educational aptitude testing, at 
the levels noted above—has been, still is and doubtless will long 
continue to be in a stage of expansion and further subdivision. 
The writer must therefore admit that, in “overdoing the job,” 
he has by no means fulfilled the initial requirements. Neither 
Part I, as now published, nor the subsequent Parts to come are 
simple in nature. It has proved necessary to consider in detail, 
with frequent reference to professional literature, many points 
raised in the earlier draft and others which have emerged 
through later developments in this field. In other words, this 
more complicated task must be undertaken and put to the test 
of further criticism before the simpler manual originally 
planned can rest upon a sound foundation. 

Indebtedness to many collaborators cannot adequately be 
chronicled. Appreciation is, of course, due first to the American 
Council on Education, which sponsored the initial survey and 
later rendered valuable assistance in the preparation of this 
book—particularly, as noted above, through having the original 
typescript duplicated for criticism. Gratitude is acknowledged 
for the moral and material support, alike, provided through 
the Council by its President, Dr. George F. Zook, his Adminis- 
trative Associate, Donald J. Shank, and above all by the late 
Dean of Columbia College, Herbert C. Hawkes, for many years 
Chairman of its Committee on Measurement and Guidance. To 
his warm and always friendly encouragement on this and other 
projects of mutual interest, I am lastingly indebted. 

Among the various authorities who took the trouble to read, 
and in detail constructively to criticize, the preliminary draft, 
special appreciation is extended to: Henry Chauncey, formerly 
Assistant Dean of Harvard College and now Associate Secre- 
tary of the College Entrance Examination Board; Irvin L. 
Child, Department of Psychology, Yale University ; President 
Allan R. Cullimore of the Newark College of Engineering; Lt. 
Comdr. Frank M. Fletcher, Jr., USNR; Capt. Albert P. John- 
son of Purdue University (on leave with the Army Air Corps) ; 
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G. Frederic Kuder, of the War Department, editor of Educa- 
tional and Psychological Measurement; I. L. Kandel, Professor 
of Education, Teachers College, Columbia University; Pro- 
fessor Andrew H. MacPhail, Department of Psychology, 
Brown University ; Dean R. L. Sackett, Committee on Student 
Selection and Guidance of the Engineers Council for Profes- 
sional Development ; Lt. Col. Kenneth E. Schnelle, AUS; Pro- 
fessor Carl E. Seashore, Department of Psychology, State 
University of Iowa; Professor John M. Stalnaker, lately with 
the College Entrance Examination Board and now at Stanford 
University ; Professor L. L. "'hurstone, University of Chicago; 
Professor Herbert A. Тоорв, Department of Psychology, Ohio 
State University; Arthur E. Traxler, Educational Records 
Bureau; Professor Ralph Tyler, University of Chicago; J. 
Richard Wittenborn, Clinical Psychologist, Department of 
University Health, Yale University; Professor Ben D. Wood, 
Bureau of Collegiate Educational Research, Columbia Uni- 
versity. 

Space does not permit full acknowledgment of their helpful- 
ness in several ways. It may, however, be noted that Child, Stal- 
naker and Wittenborn made particularly valuable suggestions 
regarding Chapter II, “Тһе Measurement of Educational Per- 
formance and Basic Statistical Principles," and Chapter VI, 
“Unitary Traits and Primary Abilities,” while Toops and Mac- 
Phail reviewed the preliminary mimeographed copy throughout 
with characteristic “line-by-line” thoroughness. Dean Sea- 
shore’s comments on numerous points were also helpful and 
illuminating. Such careful attention to our efforts by these and 
other leaders in the field of educational measurement, despite 
the inroads of war activities upon their time, is most gratifying. 

Development of the Yale Educational Aptitude Battery, and 
of certain other tests or investigations later discussed, was orig- 
inally made possible by a research grant from Nicholas Saltus 
Ludington (B.A. Yale, 1927). The Institute of Human Rela- 
tions at Yale University and the Carnegie Corporation made 
subsequent contributions to this same project. Also the Officers 
and Corporation of Yale University have for a considerable pe- 
riod supported it, either directly or indirectly, through allot- 
ments to the Department of Personnel Study and Student Ap- 
pointment Bureau. 

The senior author of this book particularly acknowledges 
deep indebtedness to his secretary and collaborator, Marion 
Treadway Inglis, who not only has patiently struggled with 
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illegible emendations but in a positive sense has helpfully aided 
its numerous stages of revision. The junior author, Dr. Paul 
S. Burnham, has long carried a major share of the basic apti- 
tude testing experiment at Yale. Dr. Burnham, then a Research 
Assistant, for several years was in direct charge of that experi- 
ment. He later served as Senior Examiner for the Connecticut 
State Personnel Department and had no direct part in writing 
the earlier draft of this book; hence frequent use of the terms 
“writer” and “author” (in the singular) will be found through- 
out all chapters in Part I and those to follow. Fortunately for 
me, he later returned to Yale and has been of great help in re- 
vising this earlier draft during the past two years. However, 
responsibility for errors of omission or commission therein and 
for what may be called (humbly and without pretense of knowl- 
edge) its general philosophy is mine alone. 

A book of this nature is inevitably an aggregate of personal 
opinions. I have honestly attempted to produce and evaluate 
fairly whatever evidence concerning educational aptitude and 
related differential measures could be found. Nevertheless, such 
evidence often is not clear-cut or decisive; its interpretation va- 
ries more or less according to the writer's outlook. Ideally, such 
a volume would be quite objective and free from personal bias, 
but only a great scholar of much wisdom can attain complete 
objectivity. 

If I have ventured to state certain opinions frankly, it was 
with the realization always, and the public admission now, of 
how fallible and mistaken these might be. If one chooses not to 
“pull his punches,” some will doubtless be wild, and he can only 
expect then to be hit back, hard. Again if, as a result, certain 
issues are brought out and clarified, the beating he may later 
take will be endurable. Moreover, while making no claim to com- 
plete freedom from bias, and even less to pontifical knowledge 
above the level of personal opinion, I have attempted scrupu- 
lously to avoid any distortion of evidence or “hitting below the 
belt.” 

As subsequently stated in Chapter V, “А Sample Aptitude 
Battery,” disproportionate emphasis has been placed there and 
subsequently throughout other chapters (especially in Part IT, 
not yet published) upon educational aptitude testing results 
at Yale University. Perhaps we are less severely critical of our 
own handiwork than of others’; parents are noticeably tolerant 
of shortcomings in their offspring! The chief reason why Yale 
tests are so frequently mentioned, however, is first, that Dr. 
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Burnham and I have far more data available regarding them 
than we have been able to collect about other materials ; and sec- 
ond, that the Yale Battery—unlike aptitude tests designed for 
one or another particular field—has significance for many. Yet 
no one is more conscious than the authors of its technical and 
other faults. 

We both believe (and here, if you will, is bias again) that its 
general plan, or approach, has distinct merit, demonstrated re- 
peatedly in the practical realm of student guidance. We hope 
that its exposition may lead eventually to the development of 
much-improved measures, following the same general pattern 
but more extensively validated. The Yale experiment has been 
limited in scope by restricted population samples and inade- 
quate research funds. Following the survey mentioned above, 
for the American Council on Education, of interest among sec- 
ondary schools and colleges in educational aptitude testing, the 
writer proposed to the Council a large-scale project along these 
lines. Nothing came of that suggestion, partly because the in- 
vestigations discussed in Chapter VI on Unitary Traits and 
Primary Mental Abilities were already in progress and might 
be expected even better to serve the same purpose. 

Some years ago (Hawkes, 1931, p. 35), a grant of $500,000 
over a ten-year period from the General Education Board made 
possible establishment of the Cooperative Test Service (a major 
project of the American Council’s Committee on Measurement 
and Guidance). This provided a wide series of objective achieve- 
ment tests, not only of “general culture,” as appraised by a 
standard battery, but also of acquired learning in many sub- 
jects and at various educational levels. This has proved of tre- 
mendous value in educational selection, placement and counsel- 
ing. Is it too much to hope that some parallel battery, evalu- 
ating relative aptitude (readiness-to-learn) for the various 
major areas of collegiate work, may likewise be developed? It 
is our firm opinion, buttressed by a considerable weight of evi- 
dence in the present and succeeding Parts of this study, that 
further experimentation with some uniform set of educational 
aptitude measures, nationally administered but centrally con- 
trolled, is of major importance for the effective selection, meas- 
urement, encouragement and guidance of American youth. 

г ALBERT BEECHER CRAWFORD 
Yale University, New Haven 
January 1, 1946. 
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CHAPTER I 


BACKGROUND OF EDUCATIONAL 
MEASUREMENTS 


INTRODUCTION 


НЕ nineteen-twenties аге well remembered as crucial 

years affecting our lives both singly and collectively in 

far-reaching ways, then and thereafter. What may not 
be so generally realized is that this period also marked a signifi- 
cant turning point in the science of mental measurement. Ear- 
lier progress had been slow, halting, characterized to some ex- 
tent by an exploration of the importance of new tests, but 
largely absorbed in the latter respect with “general intelli- 
gence.” Means for the effective analysis of data had long waited 
upon the discovery, and later upon the utilization, of statistical 
concepts or techniques which now seem elementary. For brief 
but adequate accounts of the testing movement up to that pe- 
riod, the reader is referred to two outstanding books in the field: 
one published by Truman Kelley, Interpretation of Educa- 
tional Measurements (1927), and the other by Clark Hull, 
Aptitude Testing (1928). Their accounts are of great value to 
any person desiring а review of the prior objective-testing 
movement. 

Beginning in and continuing since the 1920 decade, individ- 
ual measurement and guidance methods have made great 
strides. Various developments in basic theory and technical 
equipment for the performance of complex statistical calcula- 
tions jointly facilitated the means of dealing with extensive, 
massed data which earlier methods could not adequately treat. 
At the same time, there occurred a veritable flood of mounting 
interest in “human measurement” problems. 

The American Council on Education established in 1936 one 
of its most fruitful Committees—that on Measurement and 
Guidance—whose title epitomizes the trend of subsequent ef- 
forts. This Committee has done a great deal to clarify educa- 
tional thinking and to provide the means for improved in- 
dividual counseling. Through its sympathetic interest, the 
Educational Records Bureau has become a service organization 
of widespread importance in the secondary school field. The 
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Committee’s further, more direct, concern with such projects 
(later specifically discussed) as the American Council Psycho- 
logical Examination, the Cooperative Test Service, aptitude 
measurements and recent extensive research on Primary Mental 
Abilities, represents further contributions of outstanding sig- 
nificance for higher education. 

During recent years various state-wide, objective testing 
programs for high schools and colleges (facilitated to no small 
degree by parallel efforts of the American Council's Committee 
on Measurement and Guidance) have also come into being. 
Among the first commonwealths to utilize these modern tech- 
niques as important factors within the total range of publicly 
supported education were some of the central states—e.g., 
Iowa, Minnesota, Ohio and Wisconsin. This desirable move- 
ment has since progressed extensively ; yet it seems unfortunate 
that so many different tests and “every student” programs now 
exist as to preclude general comparability (from one region to 
another) of their respective group data or individual scores. 
А great service could be rendered by the American Council on 
Education or some other authoritative body in facilitating stud- 
ies which would make possible the conversion of various testing 
results to some common scale or basic frame of reference. Р 


T'he Vocabulary of Mental Measurements 


One result of enhanced emphasis upon objective (*restricted- 
answer") achievement tests was adoption by the educationally 
élite of such statistical terms as “correlation,” “reliability,” 
“deviation,” “validity” and “individual differences.” These 
achieved rapid circulation among teachers untrammeled by 
personal vagueness as to just what they meant. Subsequently, 
newer and even more esoteric concepts—e.g., multiple correla- 
tion, item analysis, factor theories, variance and covariance, 
confidence level, etc.—though perhaps no less well compre- 
hended by many of their facile users, have gained rather wide- 
spread vogue. 

Meanwhile, mechanical aids to scoring and tabulation of test 
results have greatly broadened the practicable scope of intel- 
lectual and vocational measurement possibilities. Тһе introduc- 
tion of separate answer-sheets which, by either electrical (ma- 
chine) or manual (clerk) scoring, greatly facilitate routine 
operations and reduce both the cost and time they require; 
numerous short-cuts developed in computational methods and 
other statistical aids—especially the modern punch-card sys- 
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tems for recording and quickly analyzing complex data—have 
together made it feasible for test administrators and research 
workers now to carry out programs which, а decade before, 
would have appeared well-nigh impossible tasks. This has meant 
that many more tests can now be administered to much larger 
groups and with far better analysis of the resulting data than 
was hitherto possible. 

A major by-product of the resultant extension (both hori- 
zontally in terms of the instruments employed and vertically 
in respect to the populations represented) among group-testing 
procedures has been improvement of the examining materials 
themselves through progressive revision and considerable spread 
in their diversity. It is with developments of this sort, especially 
during the last ten or fifteen years, that our discussion of apti- 
tude testing methods will deal. However, several considerations 
of more or less general nature must be studied first and certain 
basic assumptions set forth before attention can be given to 
differential *batteries" or specific individual measures. 


THE MEANING OF “APTITUDE” 


The term “aptitude” is often so loosely employed that it 
must be clarified, or at least arbitrarily defined, for our pur- 
Poses. The accepted use of this word among students of mental 
Measurement is substantially that given in Warren’s Diction- 
ату of Psychology (1934). Aptitude is there defined as “а con- 
dition or set of characteristics regarded as symptomatic of an 
individuals ability to acquire with training some (usually 
Specified) knowledge, skill, or set of responses, such as the 
ability to speak a language, to produce music, etc." Two points 
especially are to be noted in this definition. In the first place, 
aptitude is differentiated from skill. Skill is the ability to per- 
form some given set of responses at a given time: aptitude is 
the ability to acquire skill under appropriate conditions. In 

һе second place, the definition does not involve any assump- 
tion through the use of the term *symptomatic" as to whether 
aptitudes are acquired or innate. Although the latter problem 
I$ important theoretically and has undoubted practical implica- 
tions as well, it is not especially relevant to our present pur- 
Poses, For these, we shall regard “aptitude” as describing an 
Individuals. current potentialities to acquire various knowl- 
edges and skills, regardless of the original source of those po- 
tentialities, У . 
In 1934 the present writer offered the following colloquial 
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definition: “А successful aptitude test reliably measures quali- 
ties essential in successful future performance, by sampling 
previously acquired skills associated or antecedent to those 
qualities, but without introducing elements which can only be 
acquired from the proposed future study. Such examinations 
attempt to forecast subsequent progress by evaluating its 
known precursors. The child is still father of the man.” (Craw- 
ford, 1984, p. 15.) This usage is essentially the same as that 
later adopted by Bingham in his valuable review of aptitudes 
` and their measurement, in which he states: *We want the facts 
about a person's aptitude as they are at present: character- 
istics now indicative of his future potentialities. Whether he 
was born that way, or acquired certain enduring dispositions 
in his earliest infancy, or matured under circumstances which 
have radically altered his original capacities is, to be sure, а 
question . . . of great theoretical interest . . . but it is of 
little practical moment to the individual himself at a time when 
he has already reached the stage of educational and occupa- 
tional planning." (Bingham, 1937, p. 17.) 

Within the field embraced by these definitions, the present 
undertaking is restricted to a still more limited area which, for 
convenience, may be called that of measuring educational apti- 
tudes. We are here concerned initially with individuals’ abili- 
ties to acquire, by whatever means, knowledge and skills de- 
manded for specific curricula of schools and colleges. The na- 
ture of such further limitation may be exemplified by indi- 
cating that appraisal of the ability to acquire certain trade 
skills in clerical or factory work is not the sort of aptitude here 
considered. For example, among many others which might be 
cited, an article on selection of employees for the textile in- 
dustry (Bauer, 1939) deals with practical measurement of 
“work capacity” and involves such indices of promise for train- 
ing in that field as knot-tying by hand or with a knotting ma- 
chine, the visual sorting of threads, tactual sorting of fabrics, 
etc. This is not the kind of measure which we shall consider ; yet 
an interesting parallel between vocational and educational in- 
struments does exist in methods of approach or testing tech- 
nique. Moreover, their respective aims eventually overlap, or 


even merge, as education ultimately leads to one or another 
vocation. 
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BASIC ASSUMPTIONS 


The foregoing conception of educational aptitudes implies 
two premises which, though apparently sound, have not ob- 
tained general recognition. The first is that certain subjects of 
study differ from others in nature of the mental processes which 
they require; the second is that а substantial proportion of 
individuals differ within themselves (1.е., not only among others 
of comparable age and training) in their relative command of 
these mental processes. As Seashore maintains, *Each individ- 
ual has a sort of personal matrix for the retention and utiliza- 
tion of new achievements peculiar to himself." ( Seashore, 1938, 
p. 851.) More explicitly, Packard has stated: “Human be- 
havior may be represented as a constellation of innumerable 
factors. Through the scientific process of sampling, certain of 
these factors may be combined into composites, termed apti- 
tudes, which may indicate potentiality for particular types of 
education or training. The idea that everyone is fitted for 
Some one occupation has been definitely exploded by the many 
Investigations of psychology. When characteristics of any in- 
dividual are studied, it is found that there are some in which 
he excels, some in which he has average capacity, and some in 
Which he has little or no capacity." (Packard, 1988, p. 91.) 

It may be added that our two foregoing assumptions appear 
wholly reconcilable with the leading * factor" theories, variously 
Set forth as schematic explanations of individual differences in 
More or less specific aptitudes or mental abilities. In other 
Words, recognition of these significant personal variations, and 
their practical import for guidance purposes, may be accepted 
Without commitment respecting any one theoretical concept as 
to what their origin, psychological nature or mental relation- 


ship may be. ' 


VOCATIONAL AND EDUCATIONAL OBJECTIVES 

-called “achievement meas- 
for example, by the well- 
ducts and other indices of 


We may take as analogous to 80 
ures” in education (as represented, 
nown Cooperative Test Service pro : i 
Acquired knowledge discussed in Chapter IV) various ' trade 
tests” designed to measure an individual's attained skill in a 
Particular job or his knowledge of operational processes. The 
Corresponding parallel to educational aptitude tests 1s afforded 
Y instruments designed to measure the (as yet) untrained in- 
Ividual’s potentiality for acquiring vocational skills—that is, 
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his “teachability” or apprenticeship promise. It is somewhat 
startling to realize that capacity to learn has thus far received 
greater attention from various business and industrial fields 
(largely because of its effect there upon operational costs 
and profits) than from education, whose primary concern it 
should be. 

Impelled by the problems of excessive turnover, waste of 
material and occupational hazard, industries have long been 
actively concerned with the effective selection and adjustment 
of their employees. Certain work-aptitude tests have proved an 
important means of facilitating these efforts toward economy 
and efficiency in personal relations. Under the recent added 
stimulus of *all-out? war production, such measures—varied 
and extensive in nature—have been more widely utilized in 
this and other countries than ever before, by industries and the 
armed forces alike. Moreover, as compared with educational 
institutions, the occupational world has developed superior job 
analyses, with consequent better criteria of success; and has 
quite naturally devoted greater attention to ad hoc training in 
preparation for subsequent performance. The nature of “learn- 
ing" in business or industry, the Army or Navy, makes such 
specificity of training more readily possible than it is (or in- 
deed should be) within liberal-arts and Science curricula. It may 
be noted, however, that academic counterparts of industrial 
aptitude testing, and even “job analyses” of a sort, at last are 
gaining considerable recognition even from educators. 


The “Miniature-Test” 
4 As one means of re 
industry has made effective use of < 


T'echnique 


ble period) attempts to 
ask by presenting “learn- 
on a small scale, samplings of 


: in that function demands. А 
well-known instance of this techniq 


for the engine lathe) utilizes a + 
ating through worm gears, 


: 29, p. 68.) Performance on this instrument 
represents in “miniature” 


manual dexterity which are important in learning, for ex- 
ample, how to operate a lathe. Despite relative simplicity of 
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the device itself, differences in performance thereon have proved 
valuable in measuring essential teachability and hence in se- 
lecting apprentices for that and similar occupations. Its essen- 
tial principles have since been well used in a modified form to 
measure relative aptitude of prospective gun-layers or -pointers 
in the Navy and gunners in the Air Forces. 

More elaborate tests of this general type, such as those de- 
vised by Slocombe (1930) for motormen, by Viteles (1932) for 
electric substation operators, and later along different lines by 
other experimenters, likewise simulate typical problems and 
complex situations requiring judgment, decision and prompt- 
ness in reacting correctly to the demands which operators in 
certain vocations must regularly meet. Such procedures have 
served to reduce both property damage and personnel turn- 
over by eliminating “accident-prone” or otherwise inept appli- 
cants for industrial employment in various fields. As we shall 
later see, this fundamental miniature-testing technique—using 
of course quite different materials—can also be applied to re- 
duce wastage and “turnover” in education. 


Aptitude Testing in the Combat Services 


Great progress has recently been made in the development 
of even more numerous and specialized aptitude tests for the. 
armed forces.? Such materials are now closely guarded, as they 
obviously must be, so that these may operate fairly for all ap- 
Plicants, and for military reasons as well. Yet it is known that 
à wide range of both physical and mental examinations have 
been used in this and other countries to represent, for example, 
the problems of operation under extraordinary strains placed 
Upon the human body in aerial combat and mechanized war- 

аге; of night-vision or dark-adaptation ; of submarine tactics, 
ete. One of the most valuable early instruments of this nature 
18 the frequently mentioned Link Trainer for blind flying. 
Certain of these specialized devices go back to developments in 

е last war (especially at aviation ground schools), and those 


l A i f this general type might be described 
i great many other instruments 0 £ i 4 
Н space permitted, Specific references in the literature on this subject are too 
PUmerous to mention but may be found in the Psychological Index, or in any 


Issue of Psychological Abstracts, under the section on “Industrial and Per- , 
d ilice ht be cited at this point are the 

- Among many other references which might be cited at this | 
following: шаркы (1942), Вїпрһаш (1942), Louttit (1942), T rabion си 
Bellows (1942), Psychology and the War (1942), Miles (1945), Miles (1945a) 
and Viteles (1945). 
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in turn represent an elaboration of the miniature-test technique 
used even earlier in selecting workers for industry. Others have 
more recently been developed in connection with modern radar 
and other electronie miracles. It is, however, clearly outside 
the scope of this manual on educational prognosis further to 
discuss instruments of this nature, however vital they are for 
specialized purposes. 


EDUCATIONAL APTITUDE MEASURES 


Educational aptitude tests шау be regarded as in one sense а 
counterpart of the “task-simulation” measures we have just 
been discussing. They have fundamentally the same objective 
—to sample specific abilities requisite for learning, rather than 
knowledge per se which has already been acquired—and differ 
chiefly in the kind of future achievement which they aim to pre- 
dict. The basic principles of test construction, validation and 
application are similar in both cases, and considerable over- 
lapping of content may be expected between them in some in- 
stances, For example, a spatial visualizing test, useful in deter- 
mining apprenticeship promise for a number of occupations, 
may also prove distinctly important in measuring students’ 
aptitude for engineering or architectural courses and for cer- 
tain aspects of military or naval science. It is indeed no secret 
that, throughout various screening tests given by the armed 
forces, spatial measures of one form or another taken from 
educational batteries have consistently demonstrated their value 
for important war-selective purposes. Conversely, scores on 
this particular sort of test seem quite unrelated to either past or 
future performance in such fields as English, foreign languages, 
history, economics, law or other verbal-linguistic fields. There 
is perhaps no better cursory example of appraising differential 
rather than general promise. 

Hugo Münsterberg (1913), the distinguished German psy- 
chologist wisely imported by Harvard University early in this 
century, seems rather to have startled his professional col- 
leagues by transplanting a few vigorous seedlings from the 
cold-frame of “pure” laboratory experimentation into the 
ample garden of useful reality. His pioncer efforts thirty-odd 
years ago represent, at least in this country, the first trial of 
решш aptitude tests in selecting employees 

с ining; their value for this purpose has ever 
since been demonstrated over an increasingly wide range. 

Тһе types of specific ability, previous training, and differen- 
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tial promise, stressed as “admission requirements" to a factory 
or office, for obvious reasons differ in nature and particularity 
from those emphasized in academic circles. Manual dexterity, 
fine eye-hand coordination, exactness in arithmetical problems 
and the like may all genuinely measure aptitude for certain 
trades, but they represent no basic intellectual quality. It is the 
latter type of mental power, in the sense of relative promise 
for one or another field of study, with which this volume is 
primarily concerned. Consequently, our emphasis and applica- 
tion of the term “aptitude” will be pointed toward differential 
readiness-to-learn, especially at the college-preparatory and 
higher academic levels. 


Primary Objectives of This Survey 


Even though similar occupational or professional labels are 
necessarily employed to designate some variant types of in- 
dividual promise, it is their educational rather than their vo- 
cational aspects which we have in mind. When engineering, 
legal or medical aptitudes are spoken of, for instance, the refer- 
ence is to students’ anticipated performance in the respective 
professional schools, rather than to direct prediction of their 
ultimate success as engineers, lawyers or doctors. Yet it is nor- 
mally to be expected that satisfactory completion of such cur- 
ricula is a precursor, if not indeed an absolute prerequisite, to 
these professional careers. 

On the other hand, dramatic, musical or artistic talents have 
а primarily professional connotation; one tends to associate 
these rather more with career possibilities than necessarily with 
achievement in the formal academic sense. Certainly the Bache- 
lor's degree in Drama, Fine Arts or Music is no requisite to 
licensure and practice; personal excellence, with or without 
benefit of faculty, is the basic criterion by which individual 
Performance is ultimately judged in these several arts. 

‚ More widely applicable proficiencies (such as verbal and 
linguistic facility, mathematical and scientific promise, for ex- 
ample) will no doubt always maintain a central place in edu- 
cational aptitude testing. While the various other fields men- 
tioned above will in due course receive considerable attention 
throughout later sections, the foundations of learning just 
mentioned deserve priority. Earlier formal achievement in these 
areas, when it can be reliably measured, is of major importance; 
yet the instances are not few where even a technically excellent 
content test is inappropriate for the appraisal of scholastic 
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. Р" | amination 
promise. This is because such tests, like “essay” examine, к 
covering a particular subject, assume reasonable parity о 


x ae 3 : x with 
posure to earlier opportunities for learning in accordance v 
the usual pattern. 


RESTRICTED SCOPE OF ACHIEVEMENT TESTS 4 
Despite the best of intentions and the use of well-validete 

measures, achievement testing programs or other Toyen to ai 
acquired knowledge in various fields may sometimes prove 54 
adequate or actually misleading. However accurately the da Ё 
so obtained reflect the individual's retention of what he € У 
ready studied, they often fail to suggest his promise for othe 
fields to which he has not been exposed. His strongest шя 
lectual powers may be latent and as yet uncultivated, le 
lack of recognition or opportunity. In such cases, «опет 
tests can throw little light upon what he might have acqui E 
under other educational circumstances and from other cours = 
than those actually experienced, or what path he should thence 


forth follow. The histories of educational and vocational 216" 
ance, along with many disti 


studded with cases in which 
just such latent talents 
handicap which plagues n 

Variant curricular emp 
regions, or types of scho 
of these differences, the 


needs of students and their guidanc? 
officers. То complete a thorough inventory of educable promis® 
indices are also required of ez 


strable lack of it, after due tr 
may be allergic to physics and 


: be established without © 
posure. T'ests definitely constructed to gauge achievement i? 


any subject are manifestly inappropriate for the student 2° 
previously introduced thereto— yet the happy infection migh 
“catch” if given a chance! Nor can the desired comprehensiVe 
inventory be taken by mere “intelligence” Sut eame its wie 
even if fair for all, are too generalized for directional signifi- 
cance. 


The relative values and limitations of both specialized 
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achievement measures and nonspecific *omnibus" intelligence 
tests will later be discussed at some length. It must be recog- 
nized that a considerable degree of overlapping exists among 
these and aptitude indices of various types. The several labels 
denoting relative emphasis upon one or another means of ap- 
praising individual differences should be regarded as symbolic 
rather than clearly definitive. 


HOW LEARNING “CONDITIONS” APTITUDE 


Any educational aptitude measure, particularly at the 
academic levels here especially considered, is dependent to a 
Considerable degree upon most individuals’ previous scholastic 
achievement. Not only must they be able to read and understand 
Printed (even though supplemented by oral) directions; they 
must also have at least some elementary knowledge of the basic 
Concepts underlying certain of the major fields which these tests 
represent. The latter do not operate in a mental vacuum ; indeed 

ey reflect variable impacts of learning, in different fields, 
Upon more or less correlative aspects of mental growth. This 
cannot be appraised directly without recourse to the many tools 
Which have forged that growth. Sign language is hardly a 
Practical test of verbal comprehension ; nor can one measure the 
abstract “mathematical aptitude” of anyone to whom a and b, 
? and y are but widely separated parts of the alphabet. 

ome students ordered to duty as “trainees” in the recent 
ty or Navy College Programs (which were essentially tech- 
nological in nature) though at least high school graduates, had 
little or no mathematical basis for the courses they were ex- 
Pected to pursue; hence they raised serious problems in teach- 
Mg. An instructor at Yale, to whom a group of service fresh- 
Шеп found quite deficient in this respect had been assigned for 
*Pecial attention, undertook a so-called “refresher” course, be- 
Sinning with elementary algebra. He put on the blackboard, as 
z Starter, the equation: (2 + y) (= — y) = а? Ys and re- 
Wested its demonstration. One bewildered boy in uniform then 
asked (and this instance is literal, not fictitious), “Sir, how can 
You add or subtract letters—those aren’t numbers!” That 
trainee had a fine earlier service record, was keen and funda- 
Mentally intelligent, but somehow had by-passed the normal 
imum of secondary school mathematics. Р 
ven an aptitude test, which presumes introduction at least 
tot е rudiments in this field and some acquaintance with letter 
3Ymbols, would prove of little use in such an extreme case; and a 
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measure of formal achievement in mathematics simply ridicu- 
lous. A considerable intermediate range exists between the slight 
knowledge of mathematics necessary to mere understanding of 
aptitude type questions and the more sophisticated particular 
requirements made by achievement tests. 


Further Consideration of Aptitude Characteristics 


It is toward this intermediate range, within different major 
areas of study, that measures of the sort to be considered in this 
work are primarily directed. Тһе appraisal of linguistic facility 
requires an understanding of such terms as plural, singular: 
past, present and future tense; noun and adjective. Mechanic@ 
ingenuity can scarcely be tested without reference to pulleys 
gears and levers. Hence an educational aptitude battery аррто“ 
priate for use with high school or college students mus 
throughout its entire extent, draw in some degree upon various 
phases of their educational past in attempting to forecast the 
most likely direction of their educational future. Moreover; to 


cite Packard again : “There is no one test that can be considere 
as adequately designed to measure all the factors involved 20 


any one aptitude. For example, mechanical aptitude seems E 
be а composite of varying degrees of difficulty that may be ЯР” 
plied to the various school levels and probably includes intelli- 
gence, academic achievement, grade level, mechanical ingenuity? 
coordination, manipulation, spatial perception, construction 


sensitivity and dexterity.” (Packard 
Whether aptitudes de : ата, 1988, р. 98.) 


t velop according to environmental © 
cumstances and even casual accidents which “condition” 10 


their joint effect upon resulta? 
Yet it seems indisputable that for f 
; қ А à m dence 0 
differential promise (relative readiness to eain nt ev ide 
fields as contrasted with others), when, proper) P Pond o 
> 
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found by the time of college matriculation or earlier. At that 
stage, neither these differential abilities nor the tests designed 
to measure them actually represent *pure" aptitudes. Even if 
the latter derive (as few writers maintain) from a particular in- 
herited complex, they soon become overlaid and to some extent 
obscured by the standardizing effect of current educational 
processes throughout all primary and most of the secondary 
grades. 

Consequently, the fact that marked individual differences in 
relative learning capacity for disparate areas manifest them- 
selves at all under these conditions—even for a minority of 
students—points to their importance wherever they definitely 
appear. That they do so in many cases, at the time of college 
preparation or entrance, cannot be denied. At these educational 
levels, attained after years of schooling, the process of meas- 
uring variances within the individual must, however, rely in 
considerable degree upon some way of sampling his accom- 
plishments to date, with particular reference to their future 
applicability. Hence tests of aptitude differ from those of 
achievement chiefly through emphasis by the former upon ready 
Adaptation of past learning to the solution of new problems. 


EDUCATIONAL GUIDANCE THE FIRST STEP TOWARD 
VOCATIONAL ADJUSTMENT 
Much has already been published, and many investigations 
conducted, concerning the general subject of so-called aptitudes 
and their measurement by objective means. Considerable con- 
fusion still exists regarding the nature of these individual dif- 
ferences of relative promise for certain types of study or. train- 
Ing—e.g., mechanical, linguistic, musical, scientific, etc. Part of 
this confusion arises from an overlapping of educational with 
More specifically vocational obj ectives and counseling problems ; 
Part from analogous overlapping among diagnostic measures 
of achievement with those of aptitude. Thus the student or his 
academic adviser, influenced by the *round-peg and square- 
hole? bogeyman of vocational guidance, may unduly or pre- 
maturely emphasize the choice of a career, at a time when em- 
Phasis should rather be placed upon curricular planning ap- 
Propriate to his particular sort of readiness-to-learn. 
Ав an example, we may compare vocational aptitude for ac- 
Counting with educational aptitude for mathematical studies. 
ссоцпіапсу is an admirable profession, offering excellent op- 
Portunities for those possessing the attributes essential to suc- 
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cess therein. Among them, no doubt, is at least one phase of 
mathematical ability, though highest progress demands other 
qualities as well. The same basic learning power and the facility 
in dealing with figures and quantitative concepts which gon 
tribute to qualification as a certified public accountant might 
find equal scope in, let us say, statistics, astronomy, actuarial 
work, or certain operations in physical science. 

An important first point is to identify and cultivate what 
may loosely be called “mathematical aptitude in general. 6 
Whatever utilization thereof is subsequently made after further 
training in a particular field (whether pure or applied) 15 
surely less important than is the prompt recognition and devel- 
opment of mathematical learning capacity per se. 'T'he individ- 
ual's vocational guidance in this and other areas will probably 
take care of itself in time if intelligent educational guidance has 
laid the proper foundation for subsequent educational growth. 
Choice of the particular occupation where any such basic talent 
may be employed to distinct advantage should not be prema- 
turely or narrowly determined while the educative process 2 
still in flux. Incidentally, the term often used in vocational БЕ, 
lance, *to best advantage" (as if aptitude had only one best 
outlet), suggests a foreordained specificity which the write! 
considers fantastic. : 

"There is also some danger that a combination of primarily 
subjective circumstances and opinion, stemming from family 0! 
social influences, may too soon press an individual toward cer- 
tain courses of study for which his mental make-up is not we 


DISTINCTIONS AND CORRESPONDEN к 
APTITUDE AND ACHIEVEMENT Teen WEE 

These methods of approach to individua 

be compared in other ways as well, Both a 

ment tests are concerned with more or les 


1 measurement may 
Ptitude and achieve- 
8 separate or specific 
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fields of endeavor, rather than with a general average of scho- 
lastic work or promise. Both are customarily validated by cor- 
relating their evidence (“scores”) with other measures of sub- 
Sequent attainment in particular types of study ; yet they differ 
In respective purposes. Ап achievement test is customarily ad- 
ministered after a student has already pursued some particu- 
lar subject, to see how much he has learned about it. Ап apti- 
tude test is intended to give a preview of how well he is likely to 
do even along lines yet untried. By comparison, the achievement 
test is in a sense backward-looking and the aptitude test more 
forward-looking. This is a rough-and-ready statement, since 
Measures of formal knowledge (when appropriate to the 
Student's past experience in reference to his future program) 
Should also be duly recognized as forward-looking. 
College-entrance examinations and other achievement indices 

have for some years been extensively employed for placement 
(i.e., differentially prognostic) uses. Ав already noted, the 
Validity of all such tests, of whatever nature, is largely depend- 
ent upon the individual's earlier formal preparation to take 
Пет at an appointed time, after completion of more or less 
definitely established courses. Their content is largely deter- 
mined by that of standard curricula. Hence they must remain 
backward-looking to a considerable degree by reason of their 
Usual specificity. This emphasizes already-acquired knowledge 
Father than readiness to acquire it “from scratch” (except 
Where the two conditions merge through cumulative study of the 
Same, or a correlative, subject). Yet aptitude tests are also 
Tetroactive, in so far as they must sample past learning even 
Ог future reference in terms of its novel applicability. The 
9rward projection of their evidence, however, is comparatively 
Toad in scope. The one procedure is basically concerned with 
SPecific measurement of what has occurred, while the other 
(though also drawing upon previous experience) does so with 
16 aim of predicting what maj occur, even in fields new to the 
Student, 


Complexity of Educational Aptitudes 


So-called aptitudes or talents may be relatively simple or 
&xtremely complex. Few of those we shall consider for educa- 
Tonal guidance purposes are “unitary,” although often popu- 
arly referred to as if they were. Most of them represent a com- 
Posite of several capacities. Mathematical aptitude, engineer- 
Ing aptitude, medical or legal aptitude, talent for art, music or 
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teaching, are terms which serve as convenient—though inexact 
—labels to indicate readiness-to-learn within the respective 
areas designated. Their employment later, in chapter headings 
for example, is by no means intended to suggest that the vari 
ous so-called aptitudes are either psychologically “pure” or 
unconditioned by earlier learning. А 

It is our contention indeed (as will later appear when “prr 
mary” mental traits are discussed) that intellectual abilities of 
the sort referred to above have at college, and even high school, 
levels become too complex and variously interwoven ever again 
to be wholly disentangled. An attempt to segregate their basi¢ 
components can, it is true, be made through employment 0 
quite simple (and therefore functionally rather “pure”) tests. 
But in this process the mental tasks assigned, perforce equally 
simple, become unrealistic for practical guidance purposes: 
The mind does not operate in any pragmatic higher-learnmg 
situation by separate use of theoretical entities, but through 
their composite teamed application to the entire complicate 
task. An elementary illustration of the Gestalt concept in psy“ 
chology is a picket fence. This is composed of posts and a rail- 
ing, but the mind grasps this “configuration” as a whole (at- 
tractive, orderly, drab or hideous, depending upon color an 
state of repair), which then becomes distinctly more than the 
sum of its separate parts. 

Such is also the case configurations" ОТ 
syntheses, especially at t ich we are боры! 9 
class assignment in hist juri dence, for example, ca? 
е mas hout r егсер” 
tion, Judgment, memory, deduction or other Чел A sep- 
arable “factors. or descriptive geometry 
э mathematical, spatia» 
ly upon these abilities 
verbal” nature, Hence both 
MM De measures in 006 
“Ар those of the aptitude type 
Sed to a limited CEN K 
THE PARTICULAR FUNCTION Op "APTITUDE" TESTS 


Aptitude tests can serve a real and of к d 
Educational (or vocational) tasks, йн nime а { 
difficult and important, place different emphases te comp vost 
fundamental abilities. То repeat : “aptitude” as dE tn this 
and later chapters represents a significant, More or less special- 
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ized, combination among such abilities, constituting one ог an- 
other “talent,” and denoting relative promise for its further 
development. Almost equally significant in many cases is the: , 
recognition of ineptitude for а contemplated course of study. 
The measurement of these different combinations, variously 
Stressing readiness-to-learn in identifiable and disparate fields, 
is the aim of aptitude tests subsequently described. Perform- 
ance so appraised reflects individual variances of considerable 
importance in the guidance of students and their choice of ap- 
Propriate scholastic concentration. These represent gross 
(rather than refined or highly specialized) differences in prom- 
ise for certain broad areas which in everyday parlance are often 
too glibly distinguished from each other, despite certain posi- 
tive inter-relationships among them. 

With reference to such aptitudes and their measurement, ^ 
Our attention will be directed chiefly toward these gross differ- 
ences throughout major educational divisions rather than to 
Ппег ones (such as, for example, between biology and chem- 
istry; electrical and mechanical engineering; or English and 
story). Within each of the broad areas to be considered, 
Significant curricular subdivisions also exist, but it is question- 
able whether corresponding “sub-aptitudes” can ever be dis- 
tinguished with assurance. Moreover, before the student can 
choose between this or that special branch, he must earlier have 
Started up one or another main division beyond which the sev- 
eral branches lead. It is with these preliminary choices that we 
ате concerned, since later election of specialties within the main 
areas is obviously beyond the scope оҒа basic guidance pro- 
Sram. While, for example, it should be possible to identify 
artistic talent worthy of cultivation and encouragement, we can 

ardly expect by prognostic testing to predict the particular 
medium or technique—such as portrait painting, mural design 
or dry point— which eventually may establish the individual's 
Success as an artist. 


Importance of the Main Stream 


Throughout consideration of these general aptitudes for 
Study in many different areas, we are faced with a paradox. We | 
know that each of these, when carried to the higher levels of 
Specialization, may be subdivided. Some doctors will develop а 
Special flair for surgery; some scientists for laboratory re- 
Search; some lawyers for trial work. Yet, as earlier noted, we 
often think and talk of medical, scientific or legal aptitude as if 
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each were unitar 


ive 
y and undifferentiated among these respectiv 
professional 


groups as a whole—which is certainly not ү M. 
Nevertheless, before anyone can reach the point of даш d 
specialty he must meet the beginning, common requiremen ec 
any such entire field. That makes it not only justifiable but E 
essary to treat the aptitudes in question as if they were Ron 
less complex than we know them to be. We should ae 26 
to measure those simpler, initial aspects of readiness-to-leart i 
the levels of entrance to various broad channels, which in ti rts 
lead upward to a series of successively higher branches, all pa 4 
of a total system. An explorer starting up the Amazon may his 
uncertain as to which tributary he may later ascend or wher y; 

course may eventually take him; but at least he knows it ting 
not bring him to the headwaters of the Nile. Aptitude es 

provides one Way of guiding youth, at different stages, tow o 
appropriate wide rivers of learning and sometimes even UP to 
one or another of their main. tributaries—for example, 21 


L п : gy: 
Science first and then perhaps toward engineering, geology 
medicine, or some other major branch. 


ENTS asd 

» We consider educational aptitu ti- 
and their measurement with respect to such areas as mathem#@ a 
tific, mechanical, etc., it will appe? 
rent aspects, As already emphasize? 
ces to meet the educational regu n 
4 c major, though a distinctive combinatio 5 
of several traits may do so. For practical purposes we may ш 
sider these various aptitudes, 


scien 


5 signposts pointin 


Ppear. As Thurstone ЖЕ 7 
we can talk about musical ability 1 
52% | Ychologists, We believe that differen 
abilities are probably involved in а good voice, absolute ріс » 
а : counterpoint, and orchestration, Me 
lodic memory, ease in memorizing at the Piano, ability in musi- 
cal interpretation, and so on. How many ; ion& 

А : : и - 21У Important functio 
unities might there be in this domain? They are surely not & 
completely independent.” (Thurstone, 1940, р. 209.) 

Upon such closer examination We shall also find that, take? 
as a whole, each educable capacity represents а continuum, yun- 
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ning from simple to progressively more complex and difficult 
functions. Aptitude tests of the sort herein discussed are es- 
sentially measures of students facility in directing their past 
experience toward future problems. In the college-preparatory 
and freshman years, tests with the intellectual content necessary 
in forecasting promise for higher educational fields must sample 
learning already acquired. By the time they are administered 
to such students, the latter have passed far beyond the stage at 
which “pure” aptitudes or predispositions (if indeed they were 
ever realities) might still conceivably persist, unaffected by 
Conditions of environment, study and other activities. 


SUMMARY 

This investigation deals primarily with prognostic devices in- 
tended to measure relative promise even for subjects or areas of 
study not previously attempted. Obviously, these instruments 
cannot be used with individuals whose education and learning 
ability in general are inadequate to cope with the “miniature” 
learning problems or operational demands they make. A legal 
aptitude test should be devised as appropriate for the legally 
naive mind—i.e., it should measure mental qualities important 
Or success in law school but not knowledge of the law itself. 
et, to prove effective as an aid in guidance or selection, it must 
test those basic qualities on the advanced level already attained 
Бу college men and women of superior ability. Likewise, measur- 
Ing relative promise for the study of engineering in college in- 
Volves, among other factors, projection toward new mathemati- 
cal exercises (as yet unfamiliar to the student) of his facility 
Ш quantitative thinking as developed at high school. Yet that 
Cannot well be undertaken unless a reasonable amount of such 

inking and training has already taken place. 
Aptitude testing of one sort or another runs the whole educa- 
tional gamut. The abilities just mentioned are clearly of a dif- 
erent order from, let us say; either verbal facility or “number 
Sense” as measured in the sixth grade. To quote Thurstone 
again: “While number problems may be routine tasks for an 
adult, they may be inductive tasks for a child.” (Idem, p. 210.) 
hough later sections of this book necessarily touch upon cer- 
ain of the simpler tests (because reference to them in educa- 
tional literature is extensive), we shall confine ourselves chiefly 
o more advanced—and therefore usually more complex— 
Materials. The general type of instrument in question will be 
illustrated specifically, through presentation of certain testing 
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materials developed for prognosis in various curricular fields, 
undergraduate and professional. Naturally such illustrations 
can represent, within practicable limits, only a fraction? 
sampling of the extensive materials (good, bad and “middlin’ D 
Which have been developed throughout recent years. 
Before moving toward our main objective—the attempted 
evaluation of objective tests and other measures relating t° 
particular areas of study—a “base camp” must be establishe 
and the terrain, methods of procedure, alternative approaches 
etc. therefrom explored. This necessitates attention in a gener? 
sense to numerous batteries or tests—e.g., of achievement, ЯР” 
titude, unitary traits or primary mental abilities and of tH? 
overlapping relationships among them to which brief referenc? 
has already been made. Moreover, the consideration of any sues 
differential measures (whatever their nature may be) in a ош 
cal sense requires some familiarity with basic statistical CO 
cepts, methods and nomenclature. Therefore, a cursory Фа 
of those essentials Wil be presented in the next chapter. 
ш readers this wil] seem elementary, and to others, quite 
52 we trust not utterly— specialized. r 
a dd ge stated in the Preface, this introductory chapter 
nd those which immediately follow (constituting Part I of bale 
oem pt to evaluate differential measures of relative “educabt 
ity” at the higher levels) are now separately published. They 
Present background materials which we trust have considera? 
pa pendent value as they stand. Moreover, these are essent 
$ the later consideration of Parts II and III, which deal Waa 
orecasting promise for particular fields of undergradua ; 
graduate and Professional specialization. These subseque?” 
arts are well in Process and are scheduled for serial public? 
10D as soon as they can be completed. 


S.C.E R.T., West Bengal 
Date. S 8... S... 228... 
Acc. No. S Н... ее 


CHAPTER II 


THE MEASUREMENT OF EDUCATIONAL 
PERFORMANCE AND BASIC STATISTICAL 
PRINCIPLES 


N DISCUSSING later the value of aptitude or other tests 
for practical guidance purposes, frequent reference will 
> necessarily be made to such matters as means; distribution 
( ‘spread”) of scores; correspondence or divergence among 
various measures ; their reliability and validity (as respectively 
estimated through “correlations” either within themselves or: 
with external criteria) ; errors of estimate, levels of confidence 
and other somewhat esoteric matters. Many student counselors 
to whom the data later presented may be of interest are quite 
familiar with these statistical terms and the concepts underly- 
Ing them, and others much less so. In order to facilitate under- 
Standing by the latter, an attempt will now be made to outline 
essential methods of analysis and the basic principles or as- 
Sumptions from which they stem. 
The outline which follows has been deliberately simplified in 
ehalf of readers with little or no statistical training. Persons 
skilled in the formulas and usage of central tendency, sigma 
Values, critical ratios, simple and multiple correlation methods 
and at least the rudiments of factorial analysis may well ignore 
1t. For those less well acquainted with the latter technique, com- 
Ments thereon will be expanded in Chapter VI, where Thurs- 
tone’s Primary Mental Abilities Battery is considered in some 
detail. Accordingly, the ensuing expository comments make no 
Pretensions to a technically comprehensive or full treatment of 
ове several complex topics. Subsequent references, however, 
Point the way to opportunities for their additional study far 
eyond the limited compass of this chapter. 
One excellent presentation, more thorough than that under- 
aken here and yet not too difficult for а nonmathematical mind 
to follow, will be found in Anastasi’s Differential Psychology 
(1937) а book which educational advisers who wish to be rea- 
Sonably well informed on individual testing possibilities should 
Tead. Statistics im Psychology and Education by Garrett 
(1941) and Psychometric Methods by Guilford (1936) are 
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frequently used, Standard texts. Peters and Van Voorhis, in 
Statistical Procedures and. Their Mathematical Bases (1940), 
first offer a valuable review of the calculus and then proceed to 
discuss various statistical problems or methods in а refreshingly 
original and iluminating way. Lindquist’s 4 First Course in 
Statistics (1941) is an excellent introductory textbook. Still 
another, which requires little mathematical background to ар“ 
preciate its admirable lucidity is Walker’s (1943) Elementary 
Statistical Methods. These few suggestions of course should not 
be regarded as in any sense definitive. Additional references 0 

more specialized nature are noted elsewhere in this chapter. 


4. SAMPLING PROBLEMS, NORMS AND EXAMINATION 
METHODS 


First of all the counselor or test user should bear in mind that 
he is dealing with relative, not absolute, units; and is doing 50 
by a method of approximation based upon a series of more от 
less representative samplings. Thus, when a test or examination 


of any sort is devised, the questions appraise a candidate s 
knowledge or aptitude by 


1 employing a necessarily limited num 
ber of questions, out of th. 

asked if time only permitted. 
graded for the market, samp 
growers! produce. Obviously 
amined ; classification depen 


ее ие ets 
knowledge or the capacity to acquire it. 5 

To provide an exact measure of each person's equipment 
bearing upon even one complicated subject or skill edi түс” 
quire so extensive a number of questions as to be quite кенді Hs 
ent. Repeating this theoretical (but scarcely polls ie 
over and over again for every function to be tested ; 2 more 
obviously impracticable. Therefore it is necessary + Э Засеок 
measurement to determine how wel] y m educa 


: : . NY sample represents a! 
the appropriate materials from which it has js p It is 


equally important to determine how representative į cele 
population of people actually tested, in relation ағын ШО. 
uals who might have been tested. These 


two maj of 
1 1 1 bri Jor ty es 
sampling—one dealing with matériel employed in the т d 
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measurement and the other with personnel subjected to meas- 
urement—are equally important and difficult to control 
throughout all phases of that rather inexact science somewhat 
glorified by the title “psychometrics.” 

As will later be more fully discussed, one method of “testing 
the test itself”? depends upon administering successive item- 
samplings to the same (or some closely similar) group and then 
comparing the obtained results. Many difficulties complicate 

1$ procedure. For one, sheer human variability often enters in 
to affect individual responses; the “same person” or the “same 
group” actually is never the same twice, in a total sense. Every 
Previous exposure, to say nothing of conditions (growth, health, 
Interest, et al.) extraneous to the test itself, more or less affects 

uture response. Hence, it is desirable to study successive “рор- 
ulation samples,” not only to maintain the continuing effective- 
ness of a test, but to bring about its further improvement if 
Possible. 


Range of Sampling 

It should be noted in this connection that increased homo- 
Jeneity, either of examination material or of the groups tested, 
reduces the extent (size or range) of samplings necessary for 
valuation. Теп eggs out of a few dozen previously graded by 
а poultry farmer as “large, extra-fine” may yield a more reli- 
able index of relative quality in his output than a hundred 
Selected at random from all gathered on a particular day. While 

lese remarks may seem trite, the principle they express is im- 
Portant, Yet it is frequently overlooked in comparing or re- 
Porting mental-mieasurement data. One clear and eminently 
Sensible consideration of sampling methods, norms, etc., in more 
Precise and formal terms than this brief survey can possibly in- 
clude, will be found in a cogent article by Lindquist (1940a) 
and more fully in the introductory chapters of his later publi- 
cation on statistical methods (19405). 

Wide differences in the range of ability among students tak- 
Mg various tests or subjects present a major problem in the in- 
terpretation of resulting measurements. So-called “norms,” 
lowever established, are relative, not absolute. This fact also is 
Often disregarded in the development of basic reference stand- 
ards, including those sometimes dubbed “national.” It is true 
that reliability of sampling, other things being equal, tends to 
rise with the number of cases represented, either in subject 
Matter or in geographical and educational areas. However, 
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mere augmentation of the cases in question does not necessarily 
or by any means proportionately enhance the value of conse- 
quent results. Careful selection of the samples, to make them 
either for one purpose more homogeneous or for others more 
widely representative, is likely to prove of greater significance 
than is any indiscriminate piling up of mere numbers. То quote 
from Lindquist's article just cited: 4 

"One of the most important uses of sampling in education 
has been in the establishment of norms on tests of educational 
achievement. In general, we have judged the dependability o 
these norms by the number of pupils involved. In fact, this has 
usually been the only information concerning the dependability 
of the norms that has been provided by the test publishers. To 
take just one example from this field, one of the most widely 
used of all standardized tests has for years carried norms based 
upon the administration of the test in just twenty-four school 
systems. The number of pupils involved was quite large—about 
ninety-eight hundred—but over four thousand of these pupils 
came from just one school system and over three thousand of the 
remainder from just six school systems; that is, just seve? 
schools accounted for over seven thousand of these ninety-eight 
hundred cases used. It is possible, of course, that these twenty- 
four schools were more representative of nation-wide achieve- 
ment than if they had been selected strictly at random, but this 
is at least open to question, particularly since seven of these 
twenty-four schools were located in the same state. At any rate, 
the sample must be considered as consisting of twenty-four cases 
rather than of ten thousand cases, and hence as much less de- 
pendable statistically than we have generally believed it to be. 
Incidentally, it could be readily demonstrated that by using 
available methods of stratified sampling one could establish 8 
more reliable norm on the basis of just one or two hundre 
pupils than one could on the basis of ten thousand pupils taken 
from only twenty-four randomly selected schools.” (Lindquist, 
1940a, p. 566.) 

National Norms 


One important attempt to establish really national norms of 
educational promise (not those merely 80 designated, with little 
or no attention to sampling variances throughout the popula- 
tions tested) is represented by Flanagan's (1939) development 

l. Italics by the present writer. 
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of Scaled Scores for Cooperative Achievement Tests. These 
Scores are determined by reference to a precisely defined origi- 
nal or basic group—ninth-grade students—of mean age 14.5 
years and whose Otis IQ scores averaged 100. By thus selecting 
a broad, educationally appropriate, population as the standard 
norm, Flanagan has given test scores for various fields (wherein 
а. greater or less degree of “self-selection” obtains) more uni- 
form meaning than they previously had.” 

Toops (1939) has also proposed in detail a somewhat dif- 
ferent approach to essentially the same goal, through careful 
Selection of a large group—the “Standard Million”—among 
students adequately representing certain stipulated educational 
levels and geographical distributions. Obviously, either plan 
Still represents a sampling within our total population; though 
one much better defined, and more widely characteristic, than 
those from which other *national norms? have previously been 
derived, 

For example, administration of such well-known instruments 
аз the Otis or American Council tests, even throughout forty- 
eight states, does not of itself guarantee that the resulting 
Dorms are really national, except perhaps in а geographical 
Sense. То that degree the Literary Digest's surprisingly er- 
rTOneous Presidential-election poll of 1936 was national—it 


Covered the country, but its sampling. technique nevertheless 


Proved to have been grossly unrepresentative. By analogy, 
when test results are set forth in percentile terms or any other 


Method of expressing individual rank relative to certain pre- 
etermined oe ae one should consider as essential data what 


level and range of ability are represented by the population 
from which their reference points were derived. It is usually 


advisable to determine specific norms for the particular group 
and situation in which some instrument, even though already 


‘standardized” elsewhere, is to be employed anew. 


2. Б, i igati onducted at Yale by the writer employed Cooper- 
cum ie ара German and Spanish at the college level. Data 
obtained w, Б incorporated in а confidential report to a language evaluation 

ere P! yet for general publication. In the 


Committee and have not been released as о ) г 
COurse of the investigation, howevers some question was raised. with regard to 


e ili scaled scores in these three language tests. The fecl- 
ing ede Fer in private correspondence) that the scaled-score 
Norms for the German test are a bit more rigorous than in the case of the 

Tench test, This is pointed out here not to disparage the idea of scaled scores 

ut simply to indicate that perfection, even in their development, has not yet 

een achieved. These are further discussed in Chapter IV, p. 113. 
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SAMPLING OF FUNCTIONS AND TASKS 


Sampling also exists among the functions tested and the tasks 
chosen to measure these functions. Aptitude for the study of 
medicine, to take one example, is difficult to evaluate because 
of its complexity. The practice and study of this and other 
professions, embracing as they do various specialties, involve 
a number of functions both mental and physical (e.g., memory; 
scientific reasoning, research ability, personal adjustments 
and relationships, visualizing power, manual dexterity, health, 
etc.). It is not feasible to represent them all within the prac- 
ticable limits of a medical aptitude test. Hence only some of 
those functions—the ones which lend themselves most readily 
to measurement and appear clearly significant—are selected 
for appraisal. Certain tasks are then devised to sample individ- 


ual proficiency in these particular functions. 'The tasks are in 
turn simulated by appropriate qu 


volving sources of error extr 
problem itself. 

Fortunately, not all situations 
prognosis are so involved, nor h 


УИ admonitions should not be 
regarded as unduly dour and pessimistic, Tt is the writer’s con- 


viction—and presumably that of many other persons engaged 


8. For a more exact and comprehensive discussion of sampling ; ional 
research, see Lindquist (19402; 1940b). This problem and 12108 Hebe pea а 
mental in statistical analysis are further discusseq by Ri А, Fisher (1938). 
None but the mentally strong and brave, however, Should ехресЕ fully to cope 
with Fisher. 
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in the development of various aptitude tests—that such instru- 
ments as will later be discussed hold great possibilities for edu- 
cational progress. What has just been said about their limita- 
tions applies to all methods of mental testing and prognosis, 
whether of the restricted-answer (objectively scored) or essay- 
type (subjectively graded) examinations, although the latter 
are frequently criticized on the following counts: (1) that sub- 
Jectivity of grading makes the results statistically unreliable; 
(2) that the scale of marking (such as A, B, C, D or 90, 80, 
70, 60, etc.) has little comparability from one examination to 
another; and (8) that individual questions may be unfair, 
either because of some tricky “catch” or because they présume 
an esoteric knowledge which only certain students may have 
acquired. 


“Essay” Versus “Restricted” Answers 


This is not the place for elaborate discussion of marking and 
examination systems, but it may be appropriate to mention that 
methods have recently been developed to offset or eliminate cer- 
tain glaring weaknesses which formerly characterized many 
€ssay questions. Techniques initially valuable in the construc- 

10n of “short-form” (intelligence, aptitude and achievement) 
tests now serve a no less useful role in connection with unre- 
stricted-answer examinations. To take but one example, marked 
changes effected during the past decade (in procedures and 
nature of its test materials alike) by the College Entrance Ex- 
amination Board have both reflected, and significantly con- 
tributed to, this progressive trend. | 

Despite such improvements, most essay questions, however 
Well constructed, read and uniformly scored, still suffer one 
disadvantage. The time allowed for each answer must be liberal ; 
otherwise the inherent (and otherwise irreplaceable) power 
Content of “long-form” examinations would be much restricted 
Ih scope. Yet this same situation—need for adequate time al- 

OWance—obviously limits the number of questions which can be 
asked and the consequent sampling of each student's knowledge. 
his can be reasonably well accomplished (at least for the 


Superior candidate) by well-designed, broad questions which 
Provide an opportunity for him to go “all out.” No multiple- 
choice or true-false item can offer that same opportunity. How- 
ever difficult or abstruse a problem may be, merely checking the 


right answer does not reveal how well versed the student is on 
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that point—i.e., whether he has just enough acquaintance with 
it to make a shrewd correct guess, or considerably more knowl- 
edge than he is thus given a chance to demonstrate. 

Both free-essay and restricted-answer examinations have 
their merits and shortcomings. No clear black or white contrast 
between them (e.g., as “subjective” or objective") is quite 
realistic, because differences of this sort are relative, not abso- 
lute. In fact, no discriminatory test item, even of the simple 
true-false type (however carefully phrased or scored), can be 
Wholly objective in operation. Various factors enter to affect 
the total response-situation. While cach question comprising 5 
multiple-choice test may have been well selected from a “pool 
of previously validated materials,* responses to some of these 
pretested items, upon readministration to a new group of sub- 
Jects, may follow a different pattern from that expected. This 
te ely obvious fact—that individuals differ in reaction t° 
the same materials—is frequently overlooked in measurement 
Procedures. The range of error can be greatly reduced through 
objective methods of test construction and scoring; but the 
problem of Subject variability (i.e., that represented by the 
Persons examined) still remains. Hence the forbidding term 

subjective," often used in derogation of essay questions OT 
academic grades, applies as well to individual reactions ап 
therefore indirectly to all mental testing situations. Р 

The foregoing comments suggest that short-answer questions 
are not necessarily superior to those of the essay type; the lat- 
ter, in fact, by nature may prove more searching than true 

alse or multiple-choice items, which individually cover less 
round and offer little opportunity for exploration or develop- 
ment of each topic represented. On the other hand, these items 
do offer a much wider range of sampling the individual's ac 
quaintance with many topics. It is that possibility, even more 
than objectivity in scoring the responses, which gives properly 
Constructed tests of this type a particular value. Moreover, this 
wider range of sampling permits certain important statistical 
techniques (scarcely possible with a much smaller number 0 
essay questions) to be utilized for the improvement of measure- 
ment and guidance procedures. The topic of objective versus 
Subjective examinations is further discussed in Chapter IV. 
4. The technique of “pretesting” items, in order to establish their character- 


istic level of difficulty and discriminating power for certain groups, will be 
discussed in Chapter VII. 
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B. FREQUENCY DISTRIBUTIONS, MEASURES OF CEN- 
TRAL TENDENCY, AND DISPERSION 


А first step in the statistical analysis of test scores, as in the 
application of quantitative methods generally, is arrangement 
of the data into some logical order. What form this should take 
naturally depends in part upon the nature of observations, 
records, scores, etc. to be studied. If discrete units (e.g., right 
or wrong, male or female, white or black, etc.) are involved, re- 
Sults are classified accordingly. Most “measures,” however, as 
that word implies, extend over some sort of continuous scale. 
As the number of intervals or steps on a scale multiplies, the 
necessity for statistical methods increases, in order to bring 
meaning out of complexity. Quantitative analysis thus essen- 
tially consists of grouping, or otherwise simplifying, data 
Which are too extensive for either full comprehension or ade- 
quate comparison in their original form. 

А Assuming that we have а large number of recorded observa- 
tions (such as test scores) scattered over a wide range of scale 
Intervals, the natural procedure would be first to arrange them 
in order of ascending magnitude or excellence, and next to 
group the scores into a smaller number of categories—e.g., 0 
to9, 10 to 19, 20 to 29, etc.—than originally occur. The num- 

er of intervals employed should be neither so great as to prove 
unwieldy nor so small as to make for unduly coarse groupings. 
The scale values should also be appropriately chosen, as they 
are for example in recording temperatures. The range and fine- 
hess of reading afforded by clinical thermometers are quite dif- 
ferent from those utilized in the control of many industrial 
Processes or laboratory experiments. In psychological testing, 
Varied instruments, scales and measuring intervals are likewise 
obviously required to serve particular aims. While some of the 
Scales initially contain a great many steps—such as 0 to 500 if 
there are 500 questions in a given exercise—20 to 80 intervals 
(depending in part upon the size of “populations” examined) 
will usually suffice for purposes of analysis and tabulation. 

Arrangement of such data in order of magnitude, and usu- 
ally in grouped intervals, is known 25 “distribution.” This may 

represented either graphically by а distribution curve, the 
familiar bar-diagram method and other pictorial devices; or 
numerically by a frequency table. In any case, the purpose is to 
arrange and condense the original records in systematic order, 
80 as to indicate the relative number or proportion of cases 
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(frequencies) falling in successive intervals on a scale. The most 
common tendency for such continuous distributions is to follow 
a “bell-shaped” curve, with frequencies congregating around 
the mean and becoming progressively fewer (i.e., “tailing out”) 
toward each extreme. The normal probability curve, so fre- 
quently discussed, is really a mathematical abstraction based 
on the theory of an infinite number of cases and a scale of values 
also extending to infinity. Fortunately the aptitude tester never 
encounters data having these unusual characteristics. However; 
within their own ranges of measurement, many distributions of 
test scores do tend to approximate the normal curve. Hence; 
for purposes of analysis it is useful to employ the quantitative 
procedures and techniques developed therefrom. Standard sta- 
tistical texts usually express this curve by the equation: 


= 
М за 


dui oan 


It is not necessary to define here the terms of this expression. 
However, the above equation can be derived from the expansion 
of the familiar binomial (а + y)” when т approaches infinity. 
The assumption that peoples! abilities are generally thus dis- 
tributed—and therefore that measures of them also should be 
—underlies the great majority of statistical procedures in edu- 
cation. Sometimes, however, frequency distributions take quite 
different forms; e.g., that of chronological age among college 


freshmen usually “spreads” farther above than below the aver- 
age, as depicted in Figure I. 


Characteristics of a Distribution 
Certain items of information customarily desirable to know 
about a distribution, as quoted from Guilf 


n Р d ion O 
a list earlier compiled by Kelley (1924, р, ord's adaptation 


44), here follow: 
“1. What is it a distribution of? This involves such ques- 
tions as the nature of the variable or scale; the E ч of x eas 
urement and whether it is constant; whether th is an 
absolute zero point and where it is located, TS 
“2, Тһе number of cases or the population 
“3. Some measure of central tendency. i 
“4, Some measure of dispersion or variability 
*5. A measure of symmetry or lack of sym * 


i metry, i.e., skew- 
ness [as with the freshman age example just ae а soe 4 1. 
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AGE AT MATRICULATION OF 806 FRESHMEN, CLASS OF 1942 


32 Forecasting College Achievement 


“6. The kurtosis, or degree of ‘bunching’ at the center. 

“7. Tendency toward bimodality [two ‘peaks’ in distribu- 
tion rather than one as normally occurs].? (Guilford, 1986, 
pp. 105-106.) 


We can hardly attempt to describe all these qualities here; 
but wish merely to point out that wide. variations from the 
normal eurve frequently occur. "Го overlook them when they 
do, may introduce serious errors of calculation or interpreta- 
tion. As suggested above, some distributions are not “unimodal” 
(i.e., with but one area of maximum frequency). They may be 
bimodal (roughly like the two humps of a Bactrian camel’s 
back, as distinguished from a dromedary's one). This effect 15 
usually produced by the overlapping of two individually nor- 
mal distributions, with rather widely separated means. In 
others, concentrations “pile up” on one side rather than near 
the center of a scale, so that they are skewed (unsymmetrical). 
Even where “central tendency” is normal, the curve may be 
either peaked or flattened to an extreme degree. In any suc 
circumstances the use of certain statistical formulas may be to 
some extent inappropriate, because these formulas themselves 
are based upon assumptions of reasonable normality. 


CENTRAL TENDENCIES 


Whether or not it is feasible to report the complete distribu- 
tion for a given variable, one often seeks for a single quantity 
which is representative of that distribution as a whole. The 
most obvious quantity so used is a simple arithmetical average 
or “mean.” This is a familiar index of central tendency. In 
some circumstances, however, the average may prove exceed- 
ingly misleading, as when a few extreme cases at one end 0 
the scale grossly outweigh the majority. For example: “The 
average income, twenty years after graduation, of the Class of 
1919 is found to be $7,243.29.^ 5 After reading in the news- 
papers, or elsewhere, statements of this nature concerning те“ 
ported income of college graduates, the writer has repeatedly 
sought further information. Consistently, whenever a distribu- 
tion of the individually reported incomes was obtained, it be- 
came clear that the *average" had been considerably distorted 
by a few very high (atypical) returns. In such situations, the 

5. This is an actual quotation from the report published after its vicennial 


reunion by one class, of a university (not Yale) which the Writer sees no rea- 
son to identify. 
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median (mid-point on the scale, above and below which the 
Same number of cases lie) affords a better index of the real cen- 
tral tendency than the mean does. One inquiry along these lines 
actually revealed that the publicized earnings average was 
nearly double the median. 

To take another example, let us assume that nine men living 
on Springside Boulevard have incomes ranging between $4,000 
and $6,000 annually, and that’ Mr. Brown, with the large 

Ouse and extensive grounds at the corner, has a salary of 
$50,000. What figure best represents the Boulevard residents’ 
financial bracket”? Their average income would approximate 
$9,500 (i.e., ашари = 99,500)--ог nearly 
twice what most of the neighbors, except Brown alone, earn. 
The median however would be under $6,000; obviously a much 


More representative index for that locality. 
Another way to control the excessive effect of a few extreme 


cases upon the average is to “truncate” the distribution—i.e., 
0 fix its upper and lower limits arbitrarily (such as, for in- 
Comes, at termini of “$25,000 and up” and “under $500”). 

hus the few outlying instances are brought within a range ap- 
Propriate to the majority. This is common practice in compil- 
ing frequency data of all sorts; but even then, a median rather 
than a mean will sometimes best represent the central tendency 
characteristic of a tabulation. Still another measure of this 
tendency is the mode, or “peak” of greatest magnitude, in a 
distribution. When the latter is perfectly normal, mode, mean 


and median coincide. 
VARIABILITY OR DISPERSION 


No less generally important than the measure of central 
tendency is that of variability within а. distribution—the ex- 
tent of its spread on either side of the mean (or median). 'The 
Шоге closely individual measures are packed into a narrow 
Tange, the more accurately representative of them all a cen- 
tra] tendency becomes. Ав dispersion increases, the chance that 
апу particular score may differ widely from the average is ob- 
Viously enhanced. Therefore, an index of spread, or range, is 
essential to proper understanding of the situation as a whole. 

The index now usually employed as а measure of spread is 
the “standard deviation” (SD) often designated by the symbol 
(sigma). This has such extensive and important utility for 
Psychometrics, in many ways, that any attempt to discuss in 
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Figure II 


SCALE VALUES IN A NORMAL DISTRIBUTION 


* On scales such as are now often employed, with a 
mean of 500 and a standard deviation of 100, a score of 
600 ranks the individual at the 84th percentile—i.e., 34 
percentile points above the mean; and a score of 700, 
approximately within the top 2%. Correspondingly, 400 
—minus 1 sigma—falls at the 16th percentile; and 300— 
minus 2 sigma—at approximately percentile 2, i.e., barely 

above the two lowest ranks per hundred. The propor- 
Son СА cases respectively scoring, above 150 or below 


унет 


950 on such scales is less than 1%. Frequently this scale 
is for convenience reduced to one with a mean of 50 and a 
standard deviation of 10. For a more complete descrip- 
tion of this scoring method, the reader is referred to 
Learned and Wood (1938, p. 390) or to statistical texts 
previously cited. Application of this sigma scale to 
marking procedures and reporting of test results is dis- 
cussed in Chapter VII (р. 227). 


Р8 
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brief compass its derivation, characteristics or manifold appli- 
cations is quite impracticable. Mathematicians will recognize 
it as denoting the “point of inflection” on the normal-proba- 
bility curve. The standard deviation is as fundamental in sta- 
tistics and mental measurement generally as, for example, 
Boyle’s Law in the physical sciences or Grimm’s Law in pho- 
netics, It is also known as the root mean square deviation: i.e., 

the square root of the mean of the squared deviations,” most 


simply expressed by the formula 


‚ Опе standard deviation from the mean embraces, in either 
‘rection, slightly more than one-third of the cases in a normal 
istribution, With the mean as center or point of reference, 
about 68% lie within the range of + 1 sigma. Since frequencies 
yond that point diminish with progressive rapidity, about 
5% are included within the span of = 2 sigma; and only 14 
of 1% altogether (.0013 at each extreme) fall outside the + 3 
Sigma range. Consequently, six standard deviations—three on 
either side of the mean—may be regarded for most purposes 
as fully comprehensive, practical limits of range for any total 
Stribution which approximates normality. 


Scale Values in a Normal Distribution 


Figure II depicts a normal distribution or probability curve 
Within the - 3 с limits) in relation to the mean, sigma (stand- 
ard deviation) and percentile values. As the term suggests, per- 
Centiles denote relative standing throughout a scale of 100 in- 
Srvals. Accordingly percentile ranks indicate the percentage 
91 cases in a particular group lying at or below successive points 
Оп the scale. Any percentile is a theoretical point: it has no 
Vidth and therefore occupies no space on the scale. NH. 
Percentiles are sometimes designated by the abbreviation P, 
With a subscript; e.g., Pis for the 15th percentile point. The 
twenty-fifth and seventy-fifth percentiles, however, usually have 
? respective notations Q: and Qs (first and third quartiles). 
nalogous other terms, such as decile, quintile, septile, etc. are 
self-explanatory. The “interquartile range” embraces that half 
9f the total distribution which lies on either side of the median 
le., between percentiles 25 and 75 of a normal distribution). 
It will be noted by reference to the foregoing graph that 
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percentile points tend to *pack іп” at the mean and spread out 
toward the extremes. That, of course, is because the largest 
number of cases congregate centrally and each percentile theo- 
retically represents the same proportion (1%) within a total 
group. In actual practice this is not always true because the 
grading intervals employed, or number of cases, may be either 
relatively few or somewhat irregularly distributed. Thus grades 
obtained from a letter-marking system (such as A, B, C, D, F) 
cannot yield enough steps on the scale to spread over the 100 
percentiles which comprise a theoretical distribution. As а ге- 
sult “jumps” of a good many percentile points may occur from 
one grade to the next. In those circumstances, percentile points 
are sometimes calculated at the centers of their respective 1n- 
tervals. This ensures that the percentile so calculated best rep- 
resents all cases in each particular group. An example of this 18 


the percentile equivalent of grades presented in the accom- 
panying hypothetical Table 1. 


TABLE 1 


Distribution of Grades and Corresponding Percentiles in Oné 
Typical College Freshman Course 


Percentile Equi: ntile 
No. Receiving Calculated fo pee iral cack 

Grade Each Grade of Each Grade Interval Grade Interval 

A 97 96 92-99.9 

B 123 77 63-91 

с 174 41 21-62 

D 70 18 5-90 

F #15: 2 0.1-4 

492 


Such gross categories obviously restrict discrimination. No 
difference is noted, for example, among 123 students (more 
than a quarter of the total number) receiving the grade В. 
Their percentile standing embraces the wide interval from 68 
to 91, but which of them really deserves to rank around per- 
centile 90 (just short of an A grade) and which around 65 
(just above C) cannot be determined from these data. Obvi- 
ously, since all have the same B grade, they should all be ranked 
at the same percentile point; i.e., 77 in this instance, 

The distribution shown in Table 1 is for the &rades assigned 
to 422 students in a single course. Suppose that each of these 
students is taking four other courses; when all his grades are 
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combined a quite different situation results. In this case, for 
purposes of averaging letter grades, numerical values 4, 3, 2, 1 
and 0 are respectively assigned to А, B, C, D and F. Table 2 
Shows the distribution of these averages and the corresponding 
Percentiles, now more widely spread because the averaging proc- 
Бан has increased the number of intervals. However, percentile 

Steps? are still relatively greater near the mean than at the 
extremes, because more cases are “near average" than widely 
divergent. It is important, therefore, to realize that the signifi- 
cance of differences in percentile standing varies throughout 
any normal distribution. Thus a relatively small gain in mid- 

ling scholastic performance, as represented in Table 2 from 

4 to 2.6 (.2 net or the equivalent of one letter grade differ- 
епсе in a single course) raises a student from the 44th to the 
58th percentile of his class, since frequencies are largest around 

e mean. By comparison, progress from percentile 85 to 99 
represents a much greater advance in terms of significant accom- 
Plishment—j.e., as here illustrated from 5 B’s (8.0) to at least 
З A's and 2 B's (8.6). Moreover, anyone familiar with college 
Marking procedures will realize that the intended difference be- 
tween B and A in this crude sort of grading is usually much 


Breater than it is between C and B. 


Percentiles and Standard Deviations 
ile differences makes it improper to 
Average such ranks. Hull (1928, p. 162) states: “The result of 
із distortion is that percentile values cannot be treated like the 
Units of an ordinary scale. In percentiles, two added to two 
Oes not necessarily equal four." And later, “It is thus perfectly 
evident that ranks must not be treated like ordinary units. . . . 
еу should never be averaged, for example, except where the 
Very roughest approximations are desired. For substantially 
€ same reason the Pearson product-moment correlation co- 
efficient . . . cannot be computed directly from rank scores." 
(Idem, p. 385.) 4 
‚ Assuming the original units of measurement to be of equal 
Significance and the distribution normal, then sigma differences 
are of equivalent import wherever they occur—i.e., whether 
tween zero and .5 or between 2.5 and 8.0 standard deviations 
from the mean. This statement does not mean that individual 
deviation ranks are necessarily of equal subjective importance 
nge, but simply that the wnits of measure- 


roughout any га : 
теті (sigma values) are directly comparable throughout. 


The inequality of percent 
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TABLE 2 


Percentile Equivalents of Hypothetical Grade Averages for АП 
Courses Taken by the Freshmen Represented in Table 1 


Numerical Number Receiving Percentile 
Average » Э This Average Equivalent* 
3.8 1 100 
3.7 | 2 100 
3.6 b 8 99 
3.5 5 98 
3.4 6 97 
8.3 8 95 
8.9 19 93 
8.1 15 89 
3.0 23 85 
2.9 95 79 
9,8 98 73 
2.7 31 68 
2.6 94 58 
2.5 99 51 
9.4 25 44 
2.3 23 39 
2:2 16 34 
2.1 11 31 
2.0 91 97 
1.9 17 23 
1.8 15 19 
1:7 14 15 
1.6 ‘ 14 12 
1.5 12 9 
14 9 7 
1.8 6 5 
1.2 8 3 
1.1 5 9 
1.0 3 1 
0.9 wee 1 
499 


* The percentile equivalents shown in Table 2 have been rounded off to the nearest 
whole number except for the lowest, which has been rounded to 1 so that there will be 
100 steps (i.e., 1 to 100 inclusive) represented. Тһе procedure adopted by the College 
Entrance Examination Board for its own use and for that of the Navy differs sli ghtly 
from the above. Navy (V-12) achievement test. scores аге reported on a percentile 
scale on which the lowest value is 00 and the highest is 99. This is achieved either bY 
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Moreover, standard deviation scores are valuable not only 
for expressing the relative performance of individuals (or 
Sroups) on any single scale, but also in facilitating comparisons 
among different indices. For any group such disparate variables 
as test performance, rank in high school class, examination 
Srades, college marks—even height, age or weight—however 

ifferently measured, can all be uniformly expressed in terms of 
equivalent standard deviations from the mean. 

Other methods as well—e.g., interquartile range or percentile 
rank—have been devised for converting originally noncompar- 
able ratings into points on a common scale; but the most satis- 

actory method for this purpose is recourse to the sigma score, 
Using the average of each distribution as a basic reference point 
amd expressing зала вина therefrom in units of its own standard 
Ceviation, When the separate distributions are approximately 
normal, the sigma-unit differences represent, so to speak, inter- 
changeable parts throughout an entire scale or even battery 
04 various tests, for identical or closely similar populations. 


Ogive or Cumulative-Frequency Curve 
a distribution of data in the form of 


an “ogive” curve, An ogive is simply a curve of cumulated fre- 
quencies. The term as used in present-day measurement work 
Seems to have been proposed first by Galton (1875, P- 85), who 
*Xplained that architects so designated the Gothic *pointed 


arch,” а pair of such curves with one reversed. 

“igure III shows two ogives. Both аге based on the grade 
Verages of the 422 students represented in Table 3. Curve A 
Shows the number of students who have grades which are equal 
9 Or exceeded by a given average. Conversely, curve B shows 
how many students have averages lower than or equal to a par- 


cular on 
e. Р 
sed to show the relation of actual 


he ogive is frequently u А 
o Percentile B This is accomplished by placing a scale of 
Percentiles on the same axis as that used to represent the num- 


ег of students. The percentile value for any particular grade 


“Verage can then be read directly from the curve. 


+ att of each interval or by caleulating percentile points at 
lower limit rather ve iris center of each interval. In either case, the large num- 
s of the scale of original grades makes this procedure 

]so provides 100 percentile steps, each of which is 
ge of students scoring lower than the reported 


It is often useful to show 


T A 
tending off to the low: 


po, f Students and the fineness 

КҮЛШЕ without distortion. It а 

pe €rpreted as indicating the percenta. 
Tcentile value. 
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TABLE 3 
Cumulative Distribution of Grade Averages for Data 
in Table 2 
Number Receiving Number Receiving 
Numerical This or a This or a 
Average Lower A verage Higher Average 

3.8 422 1 
3.7 421 3 
3.6 419 6 
3.5 416 11 
3.4 411 IT 
3.8 405 95 
3.2 397 37 
3.1 385 59 
3.0 370 75 
2.9 847 100 
9.8 899 198 
9.7 994 159 
2.6 268 199 
2.5 929 999 
9.4 200 947 
2.8 175 270 
2.2 152 286 
2.1 136 997 
2.0 195 318 
Lo 104 885 
Ls 87 850 
17 79, 364 
1.6 . 58 378 
1.5 44 390 
D 82 399 
1.8 23 405 
12 17 418 
ы Э 418 
1.0 4 491 
0.9 1 
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TABLE 3 
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Mover ү MR б in the reverse direction) Shows the 
Mber of Students Having a Given or Higher Average. 
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C. ERRORS OF MEASUREMENT 


The calculation and use of sigma is also fundamental in the 
estimation of measurement errors, from which no form of ob- 
servation is free. Even precision instruments of the highest 
refinement have a tolerance or accepted range of error; be it 
only .0001 of an inch." By comparison with such standards of 
accuracy, mental measures are but coarse approximations at 
best. The errors to which they are subject may be of several 
kinds, occasioned by: (1) shortcomings within the measuring 
devices themselves; (2) fluctuations from time to time in the 
“real” ability or performance of persons tested; (3) observa- 
tional fluctuations, arising from insufficient experimental or ad- 
ministrative controls; (4) inadequate sampling, as earlier dis- 
cussed. Usually all these shortcomings enter more or less into a 
mental testing situation. It is therefore important to determine 
so far as possible, the probable limits of inaccuracy—in a sense» 
the range of tolerance—for which due allowance should be made 
in each category. 

Variability in Marking і 

А grave additional source of error in most educational situa 
tions is that of unreliable criteria occasioned by the persona 
nature of judgments in rating, marking, etc. Classroom an 
examination grades are so affected by subjective variations 10 
standards of performance as often to impair their criterio” 
values. Objective tests, elsewhere discussed, are intended to тей 
duce this subjectivity in grading through the use of restricted" 
answer material, so that the results, however scored or by whom, 
will at least agree. Yet decided limitations, if not other errors 9 
measurement as well, are introduced through such restrictions 
upon the content or form of questions asked. Mere ob јести ву 
in scoring such items does mot mecessarily offset or eliminate 
subjectivity in their construction. 


5a. Even this fine range of tolerance in measurement may now seem 0050" 
lescent; at least it has been almost incredibly surpassed. Viz: “The running 
clearance, or difference in diameters, of the barrel and plunger in the Gener? 
Motors Series 71 Diesel engine unit injector is between 40 and 60 milliont 5 
of an inch. This allows a dimensional variation of only 20 millionths ons 
inch. Such precise workmanship would have caused the famed craftsmen 0 
mue ee Century to stare with amazement.” (General Motors, 194% 
р. 58. 


6. Cf. discussion of grade-reliabilities in Chapter IV (pp. 132-133). 
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STANDARD AND PROBABLE ERRORS 


We now return from speculation to statistics. In order to esti- 
mate how well a test succeeds in fulfilling its intended purpose— 
Le. the range of Туре 1 error mentioned on page 42—we 
need also to determine the degree of variance in observed results 
which may be attributable to other measurement errors. One 
Important contribution made by statistical methods to this end 
1s the calculation of how closely а recorded score approximates 
the “true” value which would be obtained under ideal conditions 
and (theoretically) an infinite number of cases. The average of 
ten records, obtained by the same procedure—be it an aptitude 
test, a spelling examination, the reading of a thermometer or 
Spectroscopic analysis of metal—is probably more reliable 
than any one of them alone. Actually one or more may be “right 
on the beam,” but that possibility can be demonstrated only 
through some means of distributing all associated records. The 

€gree of confidence in the average finding is enhanced or di- 
Minished by the extent to which the successive readings or 
Scores correspond. As dispersion among recorded data de- 
creases, the probability that their average represents a true 
Value increases. The familiar terms “probable error? and 

Standard error” simply indicate that a given average, differ- 
ence, relationship, etc. is accurate within such limits of chance 
as they respectively define. "M | 
3 € basic concept determining “chance” in this respect again 
15 derived from the properties, illustrated above, of a normal 

Istribution curve. If 1,000 even fairly exact measurements 
Vere recorded under similar conditions by some one device, the 
results would not all precisely agree, because of observational 
*trors (Type 3 above). The resulting 1,000 separate measures 
theoretically should produce a bell-shaped distributi 

cir mean, which in turn would represent the most 
true» value. But even bells have different shapes, more or 


Toad at the base in relation to their height. In this analogy, 
‚ Teadth corresponds to dispersion. The less this is—i.e., the 
оте closely our 1,000 hypothetical records congregate so 


on around 
“probably 
less 


rather than spread out aes c 
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—the more precise those observations must have been, and the | 
closer limits we can therefore place upon the extent of their 
range. The normal curve has a range somewhat between these 
two extremes. 

In the foregoing illustrative situation representing one thou- 
sand measurements, a distribution resulted which conformed 
approximately to the normal curve. The measure commonly 
used to describe the degree of spread in a distribution of actual 
data is the standard deviation or sigma (g) earlier discussed. 
Some writers have employed a derivative of this measure, called 
the probable error (РЕ = .6745с), but the modern trend 15 
toward using c. Standard error (SE) is a term which should 
not be confused with the standard deviation. 'The SE is analo- 
gous to с but refers to the dispersion to be expected in a hypo 
thetical sampling distribution, whereas с is а measure of the 
dispersion іп a distribution of actual data. Similarly a so-called 
standard error of the mean (SEx or см) can be calculated to 
provide an estimate of the probable variability in the means 0 
successive samples if these were to be drawn by chance from ? 
larger population itself normally distributed. All of these 
terms (с, SE, or SEx) have similar characteristics, i.e. 620) 
refers to the limits within which certain proportions of cases 
will presumably be found if the distributions in question co?” 
form to the normal curve. 


Significance of Observed Differences and the Critical Ratio 


_ When the results of various measurements are compared» 
either relative agreement or relative difference of one sort о 
another among a series of observations naturally constitutes the 
basis for such comparison. The “critical ratio? (CR) provides 
а simple means of ascertaining whether a difference is valid, ? 
the sense of representing a “true” distinction, rather than gni 
possibly arising out of mere sampling errors. “True,” in menta 
measurements especially, is itself a relative term, expressing 
probability of recurrence within defined limits. +f. 

To estimate the chances that an observed difference is signif 
cany ly great, we first calculate the standard error of whatev? 
indices are being compared—e.g., averages, sigmas, correla” 
tions, etc. ; next, that of the differences between them. To illus”. 
trate, suppose we have two means, such as the average for boy’ 
and for girls, within a given academic population, on a mathe, 
matical test. How significant or trustworthy is a finding the 
boys as a group exceed the girls on this test? 
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First the standard (or probable) error of each mean is de- 
termined by methods described above; that of the difference 
from respective formulas: 


"en-us = Vno pei; PEGn-a) = VPE PEX, 


Expressed verbally, for the sake of those who abhor formulas, 
this simply means that the standard error of a difference be- 
tween two averages is the square root of the sum of their own 
respectively squared standard errors.” 

As indicated earlier, the sigma of each mean determines the 
probability limits? within which it would range if calculated 
Over and over again from successive samples within the same 
Population. If an obtained mean is 50 and its SD 5, the chances 
ате roughly 2 to 1 that successive values under the same condi- 

1015 would fall between the limits of 45 and 55 (=le) and 
nearly 20 to 1 that they would not range under 40 or over 60 

220). Similarly, the probability-limits for a difference can 
be calculated in terms of its own standard error (which, as in- 

icated above, is derived from the individual standard errors of 
Whatever measures yielded that difference). нф 

Determination thereafter of the critical ratio ( CR) is simple, 
as it consists merely of the difference divided by its own stand- 


ard or probable error, thus: 


“ 


When this ratio is at least two and a half to three (based upon 
Standard error) or four to five (based upon probable error), it 
15 under normal circumstances deemed “highly significant.” 

difference in the same direction, between like series of meas- 
urements, would then be virtually certain to recur in successive 
random samplings taken in the same manner from the same pop- 
u ation. 

CR is a valuable index, widely used with reference both to 
Agreement among some observations and to contrast among 
others, Consider a hypothetical example of recorded average 

Шегепсе between college men and women on a mathematical 

“st. As a check on this obtained difference, their ages and gen- 

era] intelligence ratings may well be studied to d 

Whethey the two groups are reasonably equal in general (as dis- 

i y applicable to the 

situaties қырда diese es 18 die лы pe the pd. S 
van’ compared. Appropriate treatment of this topic will be found in 

"OUS statistical texts herein cited. 
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tinguished from mathematical) ability. Consequently their re- 
spective means in performance along those lines are compared 
as a “control” won possible variances other than that being 
particularly investigated. If critical ratios of such differences 
are small (e.g., all under 1.0), they are not demonstrably signif- 
icant, since they do not exceed the limits of sampling error. To 
illustrate these possibilities, the following table is presented. 


TABLE 4 


Hypothetical Comparison Between 100 College Men and 
100 College Women in Freshman Year 


Probability that 


Measures Compared Mean or Standard Error Critical the СВа Indicates 
Average of the Mean Ratio “Statistica, 
(an (5,0 (CRo) Significance 
Chronological Age Men 222 7.0 2/9.2 = 0.21 58/100 
(in months): Women 220 6.0 
General Intelligence Men 51 0.8 —2/1.06 = —1.88 97/100 
Test Scores: Women 53 0.7 
Mathematical Apti- Men 54 07 4/0.99 = 4.04 0900/1000 
tude 
"Test Scores: Women 50 0.7 


As has been said before, the *significance" of any statistic 18 
judged in terms of the degree to which a difference of equ 
magnitude might occur by chance alone. Standards of signi 
cance have been developed from pure mathematics and verifie 
by experimental sampling. Dice and pennies are the favorite 
media in such experiments. If the statistic (whether it be 8 
correlation coefficient or the difference between two averages) 28 
no greater than might be produced by chance four times out 0 
ten its importance is clearly dubious. If the probabilities of its 
occurring by chance are but four times in one hundred, or оле 
thousand, significance thereof is proportionately enhanced. One 
major purpose served by statistical analysis is to determine 
such odds. 

_ The hypothetical data in Table 4 are presented for illustra" 
tive purposes. For example, in chronological age the two group? 
are nearly enough alike to have been drawn at random from 8 
larger population, Ву contrast, their respective average score? 
on the mathematical aptitude test reveal quite another picture 
Chance sampling errors alone could not account for such a larg? 
difference as that represented. Men seem definitely superio" 
to women in mathematical aptitude, as measured in this hyP™ 
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thetical situation. On the other hand, evidence in respect to their 
general intelligence test scores is less clear-cut. The difference 
in favor of women students might be accounted for by sampling 
errors, but the odds are 33 to 1 (97 chances out of 100) against 
that possibility. Such odds are high, but not so conclusive as 
extreme scientific rigor at times may demand. 


INTERPRETATION OF CRITICAL RATIOS 

The critical ratio serves quite satisfactorily to demonstrate 
that certain differences are not significant, and in a more limited 
Way that certain others are. By “a limited way” we mean that 
relative degrees of positive determination do exist, whatever 
critical point has been arbitrarily chosen as the minimum level 
of significance. Negation thereof, as indicated by a low CR, is 
reasonably conclusive and nuances of insignificance unimpor- 
tant. As just indicated, relative degrees of significance, even be- 
low the accepted critical level (i.e., CRo less than 3.0), may well 
Prove important. 

Probabilities of whatever sort progress through a “spread” 
9r range and in most instances of mental measurement refer 
to a continuous (rather than a discrete) series of original data. 

9 draw a line at some point on the scale as а fixed “pass 
mark” (whether 60 for school grades or 3.0 for the. critical 
ratio) is a convenient simplification, perhaps justified by 
Practical necessities but nevertheless arbitary. In evaluating ob- 
Served differences, we may wish to know more than merely 
Whether they are acceptable in a “yes-or-no” sense. A CRo of 

17, for example, falls below the accepted standard of indubi- 
fable Significance; yet the chances are theoretically 985 to 15 
at a difference with that critical ratio is valid (Guilford, 
986, pp. 60–61). It seems unreasonable that fifteen chances 
out of a thousand should disqualify this difference as proven, 
according to rigid statistical canons. While these should cer- 
ашу be Observed in respect to sweeping generalities, it is im- 
Portant to realize that somewhat lower critical ratios than give 
Virtual assurance still lay heavy odds as to the probability that 
an “observed” difference is “real.” То quote from the interest- 
Шо text p 5 and Van Voorhis: abe 
ep inally E: pea again protest the magic that is involved, 
chiefly for laymen in statistics, in a ratio of just 3 between a 
Uference and its standard error. This is completely arbitrary. 
veral other equally arbitrary ratios have been sug- 
Sested. . . . [f one looks through the experimental literature 
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in these fields, he will find that by these standards the vast ma- 
jority of experiments turn out to show differences that are not 
‘statistically significant.’ That would be harmless enough if such 
outcome were not so frequently misinterpreted. . . . We should 
like to bet on the stock market with the odds 100 to 1 in our 
favor, or even 5 to 1; and in the same spirit we are willing to 
consider with more favor than we accord to its rival a procedure 
that an experiment indicates to be superior by odds of much less 
than the 740 to 1 that a [critical] ratio of 3 indicates, while 
we await more conclusive evidence." (Peters and Van Voorhis, 
1940, p. 476.) 


“Confidence Levels” 


Partly for such reasons, another means of evaluating the rel- 
ative significance of various differences has gained favor In 
recent years. Although based upon the same fundamental prin- 
ciples of normal probability distribution, this method leads 0 
more precise and meaningful definitions than are immediately 
afforded by standard errors, or critical ratios, as such. The new 
procedure, so to speak, translates the latter into direct expres 
sion of the chances (as 1 or 5 out of 100) by which any obtaine 
index can be described as valid. А 

The expression, “This difference (average or correlation 
coefficient, etc.) is significant at the 5% confidence level," means 
that there are only 5 statistical chances out of 100 that it may 
mot be a true determination. Likewise the 1% confidence leve 
indicates only 1 chance in 100 that it may be a freak of sam 
pling errors. f 

The advantages of expressing significance, especially 0 
marked differences, in confidence level (probability) terms» 
rather than on а. dichotomous—wholly or not at all valid—ba5i5* 
Will not be elaborated further at this point. Citations intro- 
duced subsequently concerning various tests, if they inclu у 
references at all to the statistical significance of correlation’ 
differences or critical ratios, generally do so in the earlier ius 
of probable or standard errors, as originally reported. For tha 
reason, the method of confidence. levels can be used in late” 
chapters to only a limited degree, 


D. VARIOUS ASPECTS OF CORRELATION 


е In his memorable Hereditary Genius and many other works: 
Sir Francis Galton (1870) proved himself no less a genius tha? 
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hae a з 
К TOME pq cousin, Charles Darwin. The concept of “co- 
Plan ed 98 first enunciated by Galton (a tendency for 
rien 5% ев bis correspond with each other in varying de- 
22 de бү pn E ои still underlies the modern term, 
(olm eng ation. Many years of research have im- 
е де technical procedures, especially with refer- 
ооа pplication in particular situations (educational, 
qu “рсе, economic e£ al.), but the basic principle stands 
E gom of association between educational (or analo- 
EM lces 18 expressed by a coefficient now, unless otherwise 

, customarily the “Pearson 7.75 Its interpretation and 


basi : 2 
ре 1 Now’ presuppose reasonably normal dis- 
ње both variables (a and y); and rectilinear or 
ditions ше relationship between them. Sometimes these con- 
PEN E not even approximately met—e.g., the number of 
ат ел es small, either or both а and y values are badly skewed 
ОН otting of one against the other produces a distinctly 
he seh relationship (like an arc, or perhaps a distinctly 
RA p curve). Under those circumstances, to a greater or 
саң ent is discussed in statistical texts, the Pearson “ргод- 
поен method may not in а strict sense apply. Other 
Bice f ^ have been devised to make at least theoretical allow- 
m such vagaries of "abnormal" distribution and permit 
Талда of relationship by other coefficients and symbols. 
Wien аре; n (eta), the “correlation ratio,” may be utilized 
bs ле relationship is curvilinear to a degree which makes the 
ier r formula inappropriate. The “rank-order” coefficient 
Used 2) offers a. short-cut method of computation, sometimes 
anal О determine whether the data warrant more elaborate 
г. ES Introduction during recent years of various labor- 
e 8 charts and mechanical aids has so greatly facilitated 
noy culation of Pearson coefficients, however, that “rho” is 
is Seldom employed except when the number of cases involved 
quite small—less than 20 or 30, depending somewhat upon 


Normali T Arc 

ality of their distribution. 
gen, er correlation devices are: C; the coefficient of contin- 
| су, utilized when the data do not permit of classification into 
oef s important index 
statistical methods, 


ove, Evry represents the 
and g,,c, the respec 


and other correlation 
such as those earlier 
sum of "cross 
tive standard 


o Бог the means of calculating thi 

| Cited in Ed See any standard text on 

У Product, this chapter. In the formula ab 
| d S" (z-y), n the number of cases 


€viati 
lon of each variable. 
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a true, continuous series, but are grouped into perhaps 3 to 5 
steps on each axis; “tetrachoric r," reflecting a mere 2 X 2 
classification or two-way dichotomy (as, high or low on test 
scores versus passed or failed in course grades) ; тыз, the biserial 
correlation, serving when one of the variables, such as total 
Score on a test, extends over a continuous scale while the other 
can be expressed іп but two classifications, e.g., “right” or 
“wrong” answers to each question. This method is especially 
valuable in item analyses (see pp. 228 ff. of Chapter VII). 'The 
coefficient of multiple correlation, R, an extremely important 
index, is discussed later in this chapter under the heading 
“Academic Prediction." No attempt has been made to illustrate 
graphically a correlation surface or even to treat the important 
concept of regression lines. То do so would require considerable 
expansion of this chapter. The interested reader will find ade- 
quate presentation of these matters in Garrett (1941), Guil- 
ford (1986 and 1942), and Walker (1943). 


Application of Various Coefficients 


It would be out of place to discuss here the statistical deriva- 
tion or technical aspects of these several coefficients, which are 
quite fully set forth by authorities already cited. In fact, 7» 
p and C are seldom employed and even then usually become 
expressed after all in terms of r, through conversion into theo- 
retical Pearson equivalents. 'This magical process is carried out 
by means of tables which estimate what r might have been 1 
the data were in such form and distribution as to justify its 
direct calculation in the first place. Such speculative conver- 
sions seem rather unrealistic. Even the best “corrective” ad- 
Justment will not produce roses from a bed of statistical weeds- 
То determine some other coefficient, and then convert this (per- 
haps for necessary purposes of comparison) into a theoretic® 
r value, often merely substitutes one set of questionable assump- 
tions for another. Unless the data are flagrantly unsuited to 
treatment by the Pearsonian method (as with insufficient nunt 
bers or scale intervals, or with highly irregular or skewed dis- 
tributions), it may be just as well to apply that formula di- 
rectly, despite marked shortcomings in the situation, as to esti 
mate i through some indirect approach. However, the short- 
comings in question should be frankly acknowledged. For ех” 
ample, when the distribution of grades in school or college 
courses is limited by a marking system with not more than 5 to 
7 intervals, that cireumstance should be noted, since it definitely 
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tends to lower the magnitude of correlations obtainable under 
such restrictions. 

We do not presume or intend to disparage those refinements 
and techniques which, as indicated above, have been developed 
to measure co-relationship under special circumstances, where 
the product-moment Pearson formula may be inappropriate. As 
stated earlier, this assumes approximate normality of the two 
distributions being correlated, and their continuity throughout 
the entire scale, following a reasonably straight rather than 
curved “line of best fit” (relative correspondence).° 

Guilford has covered these points as follows: “The correla- 
tion between two sets of psychological data is often so low that 
it is difficult to decide what type of relationship probably holds. 
The simplest assumption is a straight-line function, which as- 
sumption is usually made more often implicitly than explicitly. 
Nor do many psychologists stop to consider whether their 
Measuring scale is a linear one, possessing equal units through- 
out, or whether it is logarithmic or some other type of scale, 
with unequal numerical units. The literature is strewn with 
coefficients of correlation that are absolutely meaningless and 
useless when taken at their face value; worse still, they are actu- 
ally misleading because the source from which they came and 
t e assumptions underlying their computation are unknown or 
Ignored.” (Op. cit., 1936, p. 288.) To these cogent strictures 
the present writer can only add a further dolorous note: that, 
all too frequently, pertinent data as to the level and range of 
ability represented in such reports are also inadequate or en- 
tirely lacking. m. 

hile such criticisms with respect to many pseudo-scientific 
reports seem warranted, it is virtually impossible to state just 
Where a line should be drawn between acceptable and inappro- 
Priate conditions for application of the Pearson r formula. 
nder actual working circumstances, these conditions are sel- 
om ideal. For example, although it is generally recommended 
Aat around twenty, and at least fifteen, scale intervals be uti- 
ized for this purpose, many academic marking systems, as 
Noted above, include only five to seven grades—such as A, B, 
C, D, (E), F, (X). Provided their distributions are fairly nor- 
Tal, derivation of а Pearson coefficient (even under these lim- 
iting circumstances which tend to restrict its magnitude) is 


«curve fitting" is fundamental in statistical re- 


9. Alth H 
ough the problem © of its consideration in this limited compass. The 


uw 1, Space does not permit o 
Өріс is wel] presented by Guilford (1986). 
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likely to yield a better index of true relationship than other 
methods will. 


WHAT "7" INDICATES 


The value of r is positive if the relationship between differ- 
ent variables, as generally applies among mental measures for 
example, is direct. When the relationship is inverse (as between 
age of entrance to college and academic standing—the younger 
students usually attaining superior marks), r is negative. In 
either case its relative significance for guidance, selection and 
predictive purposes cannot be simply and absolutely stated. 
This depends upon such factors as degree of homogeneity (rela- 
tive spread) within the group in question, reliability of the 
measurements themselves, other sampling factors as earlier men- 
tioned, and purposes for which the index is employed. More- 
over, r expresses a trigonometric function.? Hence its signifi- 
cance gains increasingly with its magnitude; i.e., an т of .70 has 
much more than twice the significance or predictive value of one 
around .35. So far as a general exposition of its relative values 
can be made, the statement below, adapted from Garrett (191 
р. 842), expresses briefly the accepted purport in educationa 
practice of a correlation between two variables. Қ 
т from .00 to .19 denotes indifferent or negligible relation- 
ship. 

B from .20 to .39 denotes low correlation; present but slight. 

т from .40 to .49 denotes a reasonable, and probably signifi- 
cant correlation. E 

т from .50 to .69 denotes substantial or marked relationship. 

т from .70 to 1.00 denotes high relationship, seldom found, 
because of complicating factors and uncertain measures. 

Much higher coefficients are expected when reliability (the 
internal consistency of a measure, as later discussed) is at stake. 
If the reliability of a test, as determined by correlation between 
its parts or through successive administrations, is below .90› 
considerable doubt arises as to the dependence which can be 
placed upon resultant scores (cf. Kelley, 1927, p. 29). It may 
be repeated that the foregoing comments relate particularly to 
educational measurements. What must necessarily be accepted 
as high correspondence in these areas might be regarded as low 
for psychophysical or other even more exact quantitative meth- 
ods. 


10. Tangent of the angle between a regression “line of be: 


z st fit" ses 
in question and a basic line of reference (ordin. for the ca: 


ate or abscissa) H 
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h It may, for example, be of passing interest to note that per- 
Japs the highest correlation reported in statistical literature 
a 9919. This was derived from experimental data first obtained 
E Burt E. Holmes and subsequently reported by Croxton and 
S den (1939, p. 652). Although this was a measurement task, 
Was hardly the type of mental or aptitude measurement we 
ave been discussing. The population was composed of *110 
барше orthopterous insects," more commonly known as crick- 
ets. Correlation was between temperature in Fahrenheit de- 
Srees and the number of chirps per minute for any given tem- 
Perature reading. Evidently there is an almost perfect relation- 
Ship between these two variables—i.e., the higher the tempera- 
Ee the more chirps! This represents an extreme degree of cor- 
*Sponderice; some educational measurements reach the other 
“xtreme of practically complete independence, with r approxi- 
Mating zero. 
Additional remarks on the interpretation of correlated data 
will be found later in this chapter, following supplementary 
шелі on the important questions of reliability and validity. 
қар further attention to the many-sided problems of correla- 
ìon, the reader is referred to such well-known statistical works 
as have already been cited. 


RELIABILITY 


те Two basic standards by which any mental measurement— 
iether test score, scholastic grade, personality rating or some 
other index—is commonly appraised are its reliability (internal 
agreement) and its validity (correspondence with external cri- 
teria). The first of these, while of course not restricted to that 
*rea alone, has particular importance for psychological test- 
Ing. It is an index of the consistency with which a measure per- 
forms its ostensible function. In the physical sciences, reliability 

Measurement has been highly perfected. Units of length, 
Weight, temperature, light or time can be determined so accu- 
rately that the margin of error in observation is minute. In men- 

measurements, however, because of variations in sampling 
Or inaccuracies of the instruments employed, it is rarely minute 


and often gross. ; р 
Consider what would occur іп the realm of physical science 
dollar watch without 


the most accurate time measure Were а С 
oximation to the standard 


teter were sorae mercury in a glass tube. The watch might be 
d corrections could be 


Superior to a sundial; adjustments an 
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made for the effect of temperature upon the column of mercury 
(assuming that accurate thermometers were available!) ; but 
any given measure of time or length would be almost comically 
—or tragically—inexact according to present standards. Yet 
in the field of mental testing, many instruments employed for 
guidance are relatively less accurate and consistent in opera- 
tion from day to day than a child’s first watch. 

Hence the importance of determining so far as possible a 
test’s reliability, or functional consistency. This may be at- 
tempted by several means, such as repeating it after an interval 
with the same subjects, administration of parallel, forms 
(equated through previous trial) or estimation from “split- 
halves.” These all have certain drawbacks: the first two meth- 
ods present special difficulties in assuring true comparability 
between successive trials. The third procedure is that most often 
employed; it usually consists of correlating the score made on 
odd-numbered with that on even-numbered questions. If the 


ulative score made by any 
- should closely correspond with 


did so without exception, the test reliability would be perfect. 
Actually, when questions are discriminating, they present vary- 


ing degrees of difficulty, not just in general but specifically. 
Therefore absolute correspondence between a series of Қатада 
measures is seldom found; the patient's temperature as succes- 
sively taken by different thermometers (item after item) seems 
to fluctuate somewhat. Not only do his own reactions vary, but 
the separate instruments themselves are not ideall Pub ef 
ized, despite the best of care in their construction y 
Statistical reliability of a test is in Part a function of its 
length. = Те Ша Пеле оғ reasonably equal value, the 
more questions asked, the wider and surer b ; 
of whatever it is intended, as a whole, есотев any sampling 


to apprai = 
lation between odd- and even-numbered iens eei суар 
such as either half might alone, and probably lower than pre- 
vails throughout its full length. By applicati R 


à оп of the “Spear- 

-Br formula” jt is possible 2 Бе 
man-Brown p to estimate true reliabil- 
11. See statistical references cited earlier—Guilf, 


В ога (1936 
Kelley (1924), Lindquist (1940b), Peters and Van Voorhis y (1941); 
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ity for the total test through utilizing data obtained directly 
by correlation between its split-halves. While the latter is recog- 
nized as having been lowered from its true value by thus divid- 
Ing the test, the Spearman-Brown formula may in turn some- 
What overcorrect for this foreshortening, although Thurstone 
(1937) cites evidence to the contrary for certain situations. 
Since these do not always prevail, and even a mathematically 
Sound proof does not necessarily make allowance for subjective, 

uman eccentricities, it is probable that true reliability is, in 
Seneral, somewhat less than indicated by this method. 

In a speeded test, for example, spuriously high reliability 
May be found if the odd-even technique is used. This occurs 
When mere speed of response, rather than ability or aptitude in 
the particular function ostensibly being measured, dominates 
individual performance. When separate questions are relatively 
difficult for the persons answering them, and the test is graded 
“рата in this respect, it becomes more of a power and less of a 
Sheer speed measure. With sufficient time allowance, each sub- 
Ject presumably will then find his maximum level of attainment 

Score), Under those circumstances, split-half reliabilities are 
Much less likely to be spuriously raised than in the more usual 
Case where conditions put a premium upon rapidity of per- 

ormance. у 

. Several factors should be kept in mind when reliability is con- 
Sidered, This is not a static or immutable characteristic of any 
test, It may be affected by the conditions of administration, the 
ability levels of various groups measured, their familiarity with 
analogous testing procedures or mechanics (such as the use of 
Separate answer sheets), ete. Hence what is often referred to as 

е reliability (as if it were some fixed characteristic) of a meas- 
Чге, determined under one administrative schedule or from one 
Бтоцр of students, may fluctuate considerably throughout suc- 
cessive administrations, despite no change whatever in the test 
зету. Thus its observed internal consistency, as originally com- 
buted, may rise or fall on subsequent occasions (e.g., if the 

Mme allowances are respectively diminished or increased). 

The other factors just mentioned also tend variously to af- 
fect obtained results and therefore to make estimates of true 

est reliability (however determined) less stable than one might 
SUspect from certain rather glib statements prevalent in psycho- 
Metric literature. As Richardson (1940) has pointed out, "pre- 
Cision at the moment of measurement,” though continually 
Sought in the more exact sciences, is seldom recognized even as 
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an essential desideratum in mental testing. This за 
is doubly unfortunate because the organism examined 18 pe 
subject to the influences of growth, differential training an 
other factors making for progressive internal variation. 

Attention has been directed to the split-half method, com- 
plete with Spearman-Brown correction for estimating the re- 
liability of objective tests. Because it is the one heretofore most 
commonly employed and therefore almost exclusively repre- 
sented throughout the various differential-measurement experi- 
ments, we shall consider it in due course. Much criticism has 
recently been leveled at this procedure for reasons already men- 
tioned and others too abstruse for discussion here. 


Score distributions.'? In hus 
n (1940) briefly discusses 5 
tion of reliability" and states: 
tory underestimate of the re- 


mean, standard deviation and 
12. This short-cut formula is: 


та = 0, P= np 

i! of? 
The symbols represented are: т, the reliabili 
items in the test; c,, the Standard 
tained when the mean Score is diy; 


5 ly coefücient; », the number ef 
deviation of test Scores; т, the quotient 0 
ided by the number of items 


icc MN с. 

(ie. p= ун -1-ф. 
uthors comment on 
in the authors’ opin 
n does the Split-test 


In a footnote, the а 
should be noted that, 
to time-limit tests tha; 
1939, p. 681.) 

Excellent em; 
Graduate Record Examination 


their newly proposed formulas: "It 
lon, these 


formulas apply no better 
Spearman-Brown method." (Idem, 


: atter point is afforded by “The 
Technical Handbook” (Graduate Record Ex- 
> distributions of reliability coeffi- 
-half апа Kuder-Richardson methods 
are compared. Except for the mathe ination, the Corresponding) 
1 T € imal, despite the fact tha 
the Kuder-Richardson coefficients were а population of 3,990 ne 
le of na 
random from the larger population. BS OF BAT cases chose 
› it is so graded in difficulty that it 
d other Words, those who are not 
У ension ang perform before the 
time runs out. Richardson would probably believe his more савете coz 
efficient (.878) superior to the Spearman-Brown :930 in this Situation. 


Statistical Principles 57 
the number of items. It is submitted that the latter short-cut 
method yields sufficiently accurate and informative results for 
all routine testing purposes." (Idem, p. 15.) 

The chief question here is: if split-half reliability coefficients 
аге themselves somewhat unstable at best, why should one bother 
With them at all? Hence Richardson and Kuder's proposal of a 
short-cut method and an expressed preference for “reporting 
and interpreting reliability coefficients . . - in terms of per- 
centage accuracy.” Flanagan, commenting upon this paper, 
states: *A reliability coefficient is a summary abstraction 
Which I do not believe should be encouraged. I should like to 
See the standard error of measurement? used to a much greater 
extent than the reliability coefficient.” (Idem, 1940, p. 16.) 


Sporadic quotations without their full context may often be 


Misleading. However, we consider that which follows, from this 
Same paper by Richardson, not unfair to cite even without re- 
Producing his introductory remarks: “АП test items of a cogni- 
tive nature, whether of ‘aptitude’ or achievement, are positively 
Correlated within an experimental population of the same gen- 
ега] culture. A few easy calculations suffice to show that it would 
be very difficult to construct a test of 150 items with a reliability 
Coefficient below 0.90. No tyros in the game of test construction 
SE possibly attain a coefficient below 0.90, except by mis- 
eying the items.” (Idem, 1940, p. 18.) ТЕ 

To the present writer, that sweeping generalization seems 
extraordinary, to say the least, and quite unsubstantiated from 

е experience of many earnest, harassed investigators. Reli- 


ability coefficients even of .90 do not just grow on trees, to be 
h measures are handi- 


Plucked without effort. Although suc | ат 
capped in operation by their constitutional instability, some 
шау nevertheless serve a useful discriminatory purpose Rich- 
ardson’s comment as to the virtual impossibility of obtaining 
reliabilities wnder .90 from any hodgepodge of 150 objective 
test items must have been primarily directed toward general 
Intelligence tests. Certainly his statement does not apply to 
those of the differential type. P , 7 
Remmers (1940, 1941, 1942) and certain of his students in 
& series of carefully planned experiments have demonstrated 
empirically that the Spearman-Brown prophecy formula does 
= су VAa-r. In 


mula is смеа. = 00, 
tion of the distribution of test 


t (whether derived from a re- 


18. The standard error of measurement for 
this formula, су refers to the standard devia 
Scores and r refers to the reliability coefficien 
testing or from the Spearman-Brown formula). 
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operate effectively in most objective-testing situations. Conse- 
quently, recent theoretical strictures against its application 
Seem unwarranted in any absolute sense, although (as earlier 
noted) both the split-half technique in general and its Spear- 
man-Brown correction in particular may be inappropmiai H 
certain situations, which do not meet the assumptions on en 
these procedures are based. These should not be endowed with 
а presumption of universal carry-over. 

Coefficients of reliability, when properly determined by any 
legitimate means, are of fundamental importance in rating the 
operational consistency of a measure. Roughly, this attribute 
may be described as “the steadiness with which a test keeps x 
doing its particular job.” High statistical reliability does E 
guarantee the specific value of an instrument; but low reha- 
bility certainly challenges its utility for any purpose. Relia 
bility, though highly important, has indeed more negative than 
positive significance. Determination thereof should perhaps be 
regarded as providing through its own standard error an index 
of relative unreliability or “tolerance limit” by which the im- 
mediate accuracy, or what Richardson terms “precision at the 


reliabilities, they are apt to forget 
is less than satisfactory. на 
again to physical Standards, we should be quite disturbed i 
several different yardsticks or weights, when successively ар” 
ed respective indices correlating 
T; yet such coefficients represent 
mental testing situations, even 
Peed of response, Tt must be re- 


еп n actual performance of the m 
dividuals tested, however, rather than faults of the measuring 


instruments themselves, may account for a large part of such 
discrepancies. 


s probable t the 
confidence level with which it i ог standard error, 


liability of the respective measures, A "significance? table 


which of course is never attained. If the reli 
being compared is uniformly high (e.g., .9 
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у E variance between total scores and odd- versus even- 
геге а” for example, can probably be ignored. 
ағы reliability is low for either test, another factor of 
Buk ta з yu this case attributable not to sampling errors 
Eum ће instrument itself —enters in to affect the problem of 
cance. 

С 4 ep ute most aptitude and achiev 
ЈЕ ше type in current use do have sa : 
Writer recently encountered a somewhat unusual situation, 


0 kt a 
ars in which that was not true. Two parts of a true-false 
е Bar examination respectively tested somewhat different 
Each contained 75 questions 


url of *Substantive Law." 
spect} pearman-Brown reliability coefficients of .65 and .68 re- 
relat i . For a group of 1,000 candidates, the two parts cor- 
сања .86--а coefficient which, under the circumstances, ар- 
2 surprisingly low. For special purposes, а. more detailed 
with Sis was later made of 300 cases selected at random from 
Seri m this same population (i.e., 300 anonymously chosen by 
н numbers out of the original 1,000). ‘Although both the 
lod and the standard deviation of this sample agreed very 
ely with those for the entire group from which it had been 
awn, a corresponding r of .51 resulted. | Р 
m Y the usual standards of appraisal (including Fisher's 
ethod) the difference between these two coefficients, being 
would be regarded as sig- 


et the facts belie that 


ement measures of the 
tisfactory reliabilities. 


515 which seems reasonable in accoun 
Benious paradox” is that low reliabilities within each part of 
a examination account for the discrepancy 
Ең ercorrelation might perhaps dip to aroun с 
= mple of 300 cases. Our main point is that evaluation of cor- 
E ational significance by any means should take proper ac- 
iunt of internal reliability. No systematic method for doing 
at has, so far as the writer knows been developed." 

The fact that a theoretical «correction for attenuation" can 

е applied does not meet this need, as it merely represents a sort 
of "let's pretend? game. Thus, to declare a resultant difference 
Unimportant on the assumption that it would be, if the tests in 
Question did have satisfactory reliabilities, may “explain” a _ 


ublished report. The particular ex- 


"eg These data are from a confidential unp 
iners for whom this analysis was made are among the ablest and most ex- 


Perienced in their field—no “tyros” they! 
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` discrepancy but does not remove the underlying source of error. 

Accordingly, for practical counseling or other measurement 
objectives, unreliability should always be regarded as intro- 
‘ducing a grave problem of interpretation not to be exorcised 
by statistical necromancy. No generalization should overlook 
an instance of glaringly low internal consistency. 


VALIDITY 


A seemingly high degree of psychometric consistency may 
reveal little as to a measure’s validity, or practical usefulness. 
It might, statistically, appear quite reliable (through restric- 
tion of sampling errors), yet not necessarily be valid; in other 
words, mere self-agreement within a test does not establish its 
external significance. Otherwise we could, for example, employ 
height or weight (because they can both be measured with 
great reliability) as criteria of promotion in school. Although 
grades are less accurately determinate (reliable) than physical 
measurements, they are more meaningful (valid for that pur- 
pose). Nevertheless, while validity may show considerable fluc- 
tuation, even among instruments of equal reliability, the latter 
index generally fixes the upper limit of correspondence with 
any outside criterion. Under normally realistic conditions, and 
except for possible errors or accidents of measurement, validity 
cannot exceed reliability—i.e., a test can hardly correlate 
higher with external standards than it does within itself. 

The most direct gauge of a test’s practical usefulness, obvi- 
ously, is how well it succeeds in particular measurement pur- 
poses. To repeat: reliability is an internal function (self-cor- 
relation within a test), while validity has external reference 
(correlation with independent, outside standards). The relia- 
bility of prognostic instruments, and of subsequent criterion 
performance alike, determines the probable limits of relation- 
ship between them but has no positive Significance. Validity 
does. 

Some difficulties in estimating true reliability of tests have 
already been noted; still greater are those €ncountered in deter- 
mining validity. Here not only merits of the instrument itself 
are involved ; but also its relationship to measures (such as aver- 
age grades) of variant, and often of questionable, dependabil- 
ity. Moreover, other factors—effort, attention, state of mind or 
body, distractions—may affect individual performance in any 
given hour, and obviously still more over a considerable period 
of time, such as a school year. In these respects, human behavior 
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еш indeed be termed “the functions of a complex variable.” ` 
nterplay of multitudinous forces, some relatively independent 
and others related, raises a. host of problems with respect to 
ee the most probable validity of any predictive instru- 
ent. 
Standard Error of Estimate 


In connection with the use of correlation for prediction pur- 
Poses one often sees the so-called “standard error of estimate” 
cited (cf. Peters and Van Voorhis, 1940, pp. 112-115). This is 
Simply another application of the standard deviation, previ- 
ously discussed. The standard error of estimate (written gest) 
15 a measure indicating the degree of dispersion among differ- 
ences (or errors) between forecasts and the respective criterion 
Scores actually obtained. Its similarity to the standard deviation 


с) is recognized when we recall that o is а measure of the dis- 
Persion or spread of values around the average of a distri- 
Cu си; 


bution. The formula 15: бей = € i — r. In this formula о 


is the standard deviation of criterion scores; 7 is the correlation 

etween a given test and its criterion. Thus we see that if r is 
1.00, the equation becomes gest = 9 J — 1 = 0. In that impos- 
Sible case of perfect correlation (even with crickets!) the quan- 
tity for one variable, as estimated from the other, would have no 
error, Conversely, when r = .00, the error of prediction reaches 
its maximum and becomes à m& hance, limited only 


tter of pure c ‹ 10 
Y the range of criterion scores. Practical working conditions 
naturally fall between these two extreme: 


s. The standard error of 
estimate offers, among other possibilities, 


a comparatively pre- 
_ све means of determining how useful the correlation of one or 
ingly or m comb 


More predictive measures (51 ination) will prove 
In terms of forecasting efficiency. 
Standard Error of the Correlation Coefficient 
Lest possible confusion arise in interpretation of the standard 
error of a correlation coefficient (от) as contrasted with the 
standard error of estimate (оға), it may be well to compare the 
two. The standard error of a corre ation coefficient is frequently 
Used to determine whether the obtained correlation, positive or 
This alone affords little 


hegative, has actuality in some degree. 11 
information as to how useful the correlation may be for prac- 


tical purposes. The standard error of a .00 correlation based 


1 
on 100 cases is .10 i.e. С 24857100 = 10) Inalarge 
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population where the true correlation between two factors 18 
-00, chance alone could produce samples of 100 cases each in 
which the obtained correlation might range from —.20 to +.20 
and in very exceptional instances from —.30 to +.30. There- 
fore we have no certain assurance that a .15 r (previously 
found in a sample of 100 cases) represents more than a € 
finding. However, if our sample had yielded r — .50, we cou It 
feel quite certain that so large a coefficient could not res 
solely from the operation of chance factors. In other words We 
would regard this .50 correlation as significant. Peters and Van 
Voorhis (1940, pp. 119—120) describe the significance of 8 
correlation in relation to its standard error as follows: " 
“If the sample is reasonably large and the ratio of the r to its 
standard error is 1, the odds are about 5 to 1 that there is 8 
true r between the two sets of variables somewhat above zero Ye 
the same direction as that of the sample; if the ratio is 2, hod 
odds are about 43 to 1;if 3, the odds are about 740 to 1; if 4 
about 82,000 to 1; etc. Moreover, if several successive sam- 
plings give 178 with the same sign, the probability that there 18 
а true correlation with that sign is greatly increased.” — i 
We are usually concerned with the standard error of estima 3 
only in connection with correlations which are themselves sig 
nificant. After this question of significance has been settled, we 
may be interested in how well correlation procedure can be use 
for purposes of prediction. It is in answering this latter ques- 
tion that the standard error of estimate formula is helpful. 


Adjustment for Restriction in Range 

Factors of selection fr 
and other conditions as 
test score with grades c 
tendency toward agree 
diminution of coefficients through such causes, 


One procedure of that sort involves 
tion in range." Correlation coefficients 
spread of ability among individuals is restricted ; selective ad- 
mission to college by means of entrance examinations or intel- 
ligence tests exemplifies this fact. Elimination of the weaker 
candidates removes them from subse 


і 5 quent competition with 
those of higher ranking. The desirable aim of that process is; 


adjustment for “restric- 


63 


of course, to reduce scholastic mortality among the admitted 
Broup; but exclusion of the least promising applicants, while 
ше the proportion of failures in college, also reduces the 
Т” of correspondence later to be found between whatever 
ae has been utilized for selection and subsequent perform- 
UM among students accepted. The more effective any pro- 
i ure is in this respect, and the more reliance 18 placed thereon 
П a selective process, the less highly will it correlate with later 
achievement. Fulfilling its purpose tends, paradoxically 
: gb to rule out the strongest possible evidence of its own 
value, 
Adjustment for restriction in range attempts to estimate 


what the correlation between two measures would be if either or 


both of them had a wider spread than is actually present (i.e., 
"4 the circumstances just discussed, if all applicants—including 
Ose most likely to fail—were admitted to college). In this 
connection, Peters and Van Voorhis (1940, pp. 209-212) dis- 
cuss two formulas originally proposed by Kelley. The first, 


referred to as formula (129), is 


Statistical. Principles 


a _ Vicks 
> М1—т1и 


and was developed in connection with reliability coefficients. 
The other, 
ек 
c A 1— Ru 
z^ уи 
referred to as formula (181), is customarily used in adjusting 
Validity coefficients or so-called interfunction correlations.”* 
Formula (181) has frequently peen criticized for producing 
extreme adjustments, particularly in connection with low in- 
itial coefficients. Peters and Van V oorhis present evidence sug- 
esting that both formulas are about equally suitable. From a 
8raph of these functions, plotting actual values in a series, it 
becomes apparent that the former provides the means of ad- 
Justment most in keeping with what practical experience seems 
to warrant. Since it conservatively estimates the effect of re- 
Striction (or conversely of heterogeneity) upon obtained co- 
efficients, formula (129) has been used with certain data we 


15. Symbols used in formulas adjusting n coefficients for the effect 
of restriction in range are: с and 5 the st ations of the restricted 
and unrestricted distributions, respectively; 711 and R,, the correlation ср“ 
efficients in the restricted and unrestricted populations, respectively. 


E 


correlatio 
andard devi 
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shall later report. Furthermore, in order to avoid the absurd 
effects sometimes produced in correction of very low coefficients, 
only those of .20 or larger have thus been adjusted. These 


comments refer to certain tables which will be presented later, 
in Part II. 


Attenuation 


Another correction sometimes employed is that for attenua- 
tion. This refers to the reliability of associated measures; it 
indicates what the correlation between them might be if each, 
for example, had perfect reliability. Under practical working 
conditions, a coefficient of about .65 between grades and differ- 
entially prognostic test scores may be regarded as quite high. 
Where their reliabilities are respectively .70 and .90, correc- 
tion for attenuation would theoretically raise that correlation 
to .82. Even were this adjustment applied to the criterion 
(grades) alone, according to the test itself no corresponding 
beneficence, the adjusted r would be .77 (cf. p. 59). 

Such corrections may serve a useful purpose, as for example 
in comparing the relative efficiency of tests administered to 
groups of unequal ability, or correlated against different cri 
teria. For practical guidance, however, the validity of predic- 
tive measures must be appraised in terms of their actual set- 
Ung. Correction for errors of measurement and sampling (low 
reliabilities whether in the test itself or in the criterion) has al- 


Coefficient of Alienation 
In the eaxlier remarks on correlation, a table reproduced 
from Garrett indicated the relative Significance of coefficient 


magnitudes. By rigorous and sanctimoniously pure statistical 
standards, correlations from .50 to .60 


rather insignificant. It has become quite fashionable to employ 
the “coefficient of alienation” k, whose value is мІ т, 


Statistical Principles 65 
M E of test validation in the educational field. While Ё is a 
especiall cautionary index, it has at times been overemphasized, 
Someti y by those who take for granted the loose statement 
just 2 made that “J; measures the absence of relationship 
What z шеш its presence." (Guilford, 1936, p. 362.) 
ley (191 ye represent, according to its parent, Truman Kel- 
than + > pP. 59—61), is the degree to which all other factors 
hi he test itself correlate with its criterion. 
of c something quite different from measuring “absence 
of li е lonship," which seems analogous to measuring “absence 
aide ht.” There can be variable degrees of illumination or 
of ae 80 long as any light prevails ; but its absence (or that 
old o чор) suggests a complete obscurity below the thresh- 
meaningful observation. 
D CRITERION RELIABILITY 
e problem of validity coefficients is 
hich frequent reference is made in 
given criterion (e.g. in- 
the personal judgment of 
limits the possible correla- 
lative performance of апу 
ed; yet that obvious 
the shortcomings of 


VALIDATION AN 
Bearing directly upon th 
this oe factor, to w. 
ivid olume, that reliability of any 
а course grades or averages, 
Dons pou an interview, etc.) 
B ss erewith. One cannot predict relati 
unless it can be dependably apprais 
rlooked when 


95Jective guidance measures or ) | 
tell; tructed achievement, general in- 
d igence or differential aptitude tests alike are substantially 
s consistent in what they measure (ie. more reliable) than 
i the usual indices of scholastic ranking which those tests are 
tended to predict. A major bane of educational prognosis is 
216 comparatively low dependence which can be placed upon 
d criteria as marks, whether in high school, college or gradu- 
n studies. One recent major aid in this respect is the increas- 
& use of objective tests as in 
ese, as earlier noted, have thei t | 

m ОЧ measure is usually appraised with considerable ac- 
qr chief factors which tend to reduce stability of educa- 
қаны; criteria are variations, first in students" actual perform- 
ie and second in subjective estimation thereof by instruc- 
re i However carefully the latter report their respective 
to 1 es, these two influences both operate (often for good cause) 
ower the statistical reliability of marking. Throughout 
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many schools and colleges, and for different situations or sub- 
jects of study, this (in terms of correlation between grades in 
successive years, semesters or quarters) has tended to range 
between .60 and .80. The latter coefficient would rank quite 
high in the realm of grading—around .70 represents a more 
generally typical index of consistency prevailing among class- 
room marks, for the same persons in the same subject, at school 
or college. 

The difficulty of obtaining even immediate—to say nothing 

of long-range—consistency among achievement measures is il- 
lustrated by data from Traxler’s (1943) recent discussion of 
the Cooperative American History Test. His report mentions 
the New York Times survey of high school students? acquaint- 
ance with our national history and geography. It also refers to 
the Cooperative Test just cited as *one of the more widely used” 
measures in this field. 
Я However, correlations of scores thereon with school grades 
in two analogous (twelfth-grade) American history courses 
were under .60. Traxler states: “The relationship appears to 
be a little lower than it usually is between scores on an achieve- 
ment test and marks in the same subject. In a study reported 
by the [Educational Records] Bureau in 1937, the median of 
121 correlations between school marks and Scores on Coopera- 
tive Achievement Tests was .72. However this earlier correla- 
tion was probably a little too high because of the fact that the 
tests were being used to some extent as a basis of marking in а 
few of the schools which contributed data for that study.” 
(Traxler, 1943, p. 47.) 

Reliability of these objective tests (as frequently determined 
for different groups and levels) normally exceeds .90 or .95. 
The coefficient reported by Traxler for this Cooperative Amer- 
ican History Test is .96 (predicted from split-half correlations 
by the Spearman-Brown formula). It would of course be quite 
possible for two measures of high reliability to correlate but 
little with each other; that would simply indicate that each 
tested, with consistent results, a separate mental function. It 18; 
however, difficult to understand how school grades and achieve- 
ment test scores in a particular subject, such as American his- 
tory for example, could correlate under .60 unless one or both 
indices have low internal reliability. In most situations of this 
sort, academic grades probably offer the less stable criterion. 


16. Data are reported in Chapter IV (p. 133) indicating the Tange of such 
coefficients for Yale freshmen in one year as .53 to .85, with a median of .78. 


Statistical Principles 67 


We do not mean to suggest that school or college grades 
should be dispensed with as unsatisfactory and replaced by ob- 
Jective test scores merely because the latter are more statisti- 
cally consistent. We do, however, intend definitely to emphasize 
the attenuating effect upon all prognostic correlations of low 
reliability for either “variable” (a term almost ironically ap- 
Propriate to many school and college marks). It can hardly be 
too much reiterated that prediction of future academic success 
(whether from earlier academic records, entrance examinations, 
achievement or aptitude tests) in any field is limited by the 
Slicacy—1i e, discriminating power—of achievement criteria. 
For that reason, even prognostic tests with individual reliabil- 
ties of ,90 or even .95, through no fault of their own, can 
Scarcely be expected to correlate higher on the average than .70 


vith usual grades in course. In short, that is “рат.” Jn 
е have referred to correction for attenuation in adjusting 
or unreliability of a criterion (e.g. & student's grade in Eng- 
ish or chemistry) as akin to lifting oneself by his bootstraps. In 
а practical testing situation, the petting odds on a certain fore- 
Cast of performance, or а counseling recommendation, should 
be determined realistically. Whatever vagaries of marking are 
inherent under the circumstances must be accepted ; regrettably 
Perhaps, but nonetheless unavoidably. Yet it should be realized 
that the chances of success in differential prediction or guid- 
ance сап be enhanced as grades (or other performance criteria) 
.°Come more dependable. We cannot defend increasing the odds 
favor of test results by & theoretical assumption as to what 
he probabilities might be if our basic measurements were only 
More accurate. Yet it is reasonable to point out that prognostic 
ests would do a better job if the means of evaluating subse- 
Quent performance were sharpened. Whatever shortcomungs 
Such instruments may have, they are not the only transgressors. 
n fact their sins might be washed away to а notable degree if 
Breater reliability of the criterion measures provided more soap. 

DISTRIBUTION OF CORRELATION COEFFICIENTS 
It has been emphasized by Fisher (1938), Lindquist (1940b) 
et al. that the long-accepted method of appraising the authen- 
ticity of correlation coefficients in terms of their standard or 
s is open to serious ques- 


Probable errors under certain conditions is 0 E 
tion. Several authoritative books on statistical procedures con- 


tain a table giving probable errors of r as determined merely 
from the number of cases and magnitude of the coefficient. This 
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provides a convenient short-cut by which any investigator may 
ascertain, without bother of computation, the probable error of 
a correlation. If the coefficient is .65, for example, and N = 
800, one need only locate .65 at the top of a column and 300 
down through another on the side to find a PE of .0225—all 
very simple! 

Тһе writer is no statistician and freely admits his amateur 
standing in this respect. Yet he ventures to suggest that even 
some experts in the field have failed to stipulate the assumptions 
upon which a table of this sort is based. One is that the distribu- 
tion as a whole from which any sample (be it 30, 300, 3,000 or 
any other number of cases) may be drawn is theoretically both 
normal and infinite. There are, to be sure, various cautions rec- 
ommended in statistical texts with respect to checks on the 
normality of distributions, and tests for rectilinear versus curvi- 
linear relationships. 

" Fisher (1988) and others have stressed this common over- 
sight in statistical procedure (use of the standard error or its 
derivative, PE, when the number of cases is too small) and have 
also demonstrated that the standard error is invalid for quite 
high correlations. The standard error concept assumes a normal 
variance on either side of the observed determination and es- 
tablished probability-limits within which a series of repeated 
outcomes would be likely to fall. Common sense (substantiated 

„in this case by Fisher's theoretical evidence) suggests that when 
а coefficient is unusually large (over .90) the chances of its 
true value being greater than that obtained, are less than of its 
being smaller. Under those circumstances the probability-limits 
of error no longer can be expected to follow a normal distribu- 
tion curve, but instead are compressed toward the upper limit 
and skewed, by comparison, downward. Here is another reason 
for not blithely accepting high test reliabilities at face value. 

jee remarks are perhaps out of place in a cursory review 
ot statistical principles. Few of the data cited later have been 
үре їп Fisher’s confidence level terms, since they antedate 
‘ie E наи regarding significance; and furthermore 
Mud by Fi ue involved is usually larger than would be 

гесуей by Fisher's small-sample theory. Without further tech- 
nicalities ; 1t may simply be stated that the method of reporting 
authenticity of correlation coefficients in confidence levels (e.g 
as 5, 1 or fewer chances in 100 that they are not significant) is 
certainly desirable In special circumstances. It is, however, not 
generally superior to, or different from, long-established (criti- 
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cal ratio) standards, where assumptions underlying the latter 
are valid. Indeed, Peters (1948) recently mentioned this point 
In connection with prevalent misconceptions regarding Fisher's 
contributions to statistical methods.” 


ACADEMIC PREDICTION 


. The forecasting of educable promise, whether general or 
directional, is an important aim of most grading and examining 
processes. That achievement tests of various types may be pro- 
jected forward as indices of aptitude within their respective 
areas has already been emphasized (Chapter I, pp. 14—15).` 
However, academic prediction is by no means dependent upon 
terminal examinations (whether in school or college) except in 
50 far as these, plus other differential measures of individual 
promise, may be available at any given time. It is often con- 
venient to combine them into a single index, or forecast— 
usually through the method of multiple correlation. R, the 
multiple correlation coefficient, is particularly important in ed- 
ucational measurement and guidance, since it provides a means 
of determining what efficiency can be expected from a teamed 
Combination of various scores or previous academic records, 
each appropriately weighted in order to obtain maximum fore- 
Casting power from their complementary elements. This may be 
roughly likened to the resultant of component forces in physics. 

Thus if several elements are related to the criterion for which - 
individual predictions are desired, it is possible to determine the 
relative best weights which should be accorded to each in order 
to maximize their power as а team—i.e., their net, merged effec- 
tiveness, That involves ascertaining first the relationship of 
each member with all the others (as well, of course, as with the 
criterion itself). A familiar example of this technique is the 
combination of rank in high school class with scores on some sort 
of intelligence or scholastic aptitude test, to yield an individual 
Prediction of freshman standing. This procedure has been em- 
Ployed for many years (May, 1923; Johnson, J. Б., 1927; 
Crawford, 1930). It may of course utilize other variables as 
Vell, such as Regents’ or College Entrance Examination Board 
Brades, interest and personality ratings or even age. (If the 

atter is included among more definitely educational measures, 
h of Fisher's analysis of variance technique 
]tural than to educational or psychologi- 


i indivi i ible. For perti- 
fal experi atched grouping of individuals is possi à 
nent boue cf thi Sie ЖЕГЕ see the following additional references: 


Johnson (1948), Garrett (19482), and Hotelling (1948). 


я 17. Peters also maintains that muc 
5 more directly appropriate to артісі 
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it usually has a negative weight, because younger students do 
somewhat better academically, in secondary school and college 
alike, than their older classmates.) Again, as has been illus- 
trated in detail by Hull (1928), multiple-correlation or regres- 
sion-line techniques can be applied to combine scores on several 
aptitude tests in order to obtain therefrom maximum forecast- 
ing efficiency for individual prediction purposes. 

Formulas and methods for multivariable correlations are 
also presented, and the step-by-step procedures in such calcu- 
lations fully outlined, by the authorities on statistical methods 
already cited. It may be noted in passing that as many variables 
as one wishes may theoretically be employed in determination 
of their respective optimum weights and the consequent multiple 
coefficient. However, as their numbers grow, this process rapidly 
becomes more and more laborious. Тһе corresponding gain in 
predictive yield is likely, at the same time, to fall off as the fac- 
tors utilized increase, due to overlapping among them. Effi- 
ciency of their combined predictive powers is enhanced, propor- 
tionally, both by high correlation of the individual variables 
with the criteria, and low correlation among the indices them- 
selves. For this reason it is usually not economical to employ 
more than five variables as a team in multiple correlation ; while 
often two or three will so nearly approximate the results obtain- 
able from a larger number as to make the added effort of using 
others unjustified. The law of diminishing returns is apparent 
in this situation (Hull, 1928, p. 260). 


vector-analysis techni 
ics. To state that “4 
vectors upon certaj 
topie, except for those already famili 
(though perhaps not with their appli 
cal realm). 


ted in problems of factor analysis, its his- 
tory and methods, will find an excellent review and critique i” 


the last chapter of Guilford'g Psychometric Methods (1936s 
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рр. 457ff.). This includes, for example, consideration of Spear- 

man’s (1904, 1927) pioneer two-factor (or bifactor) theory, 

his “tetrad-difference” method of analysis and other important 

approaches to the question. Thurstone’s alternative multiple- 
actor concept is set forth in his classical work, T'he Vectors of 

pane (1985), and other writings too numerous for mention 
ere. 

А simpler, but for most educators or counselors entirely ade- 
quate, exposition of this topic and of the PMA experiment it- 
self (subsequently described in Chapter VI) will be found in 
the same author's Educational Record article, “А New Concept 
of Intelligence? (1986). In Chapter XI of Anastasi’s book 
Previously cited, she gives a succinct account of various “mental 
Organization” theories. One of the best general references for 
those seeking further knowledge of this important, though 
rather complicated, topic is Thomson’s Factorial Analysis of 
Human Ability (1989) ; another is Statistical Procedures and 

heir Mathematical Bases by Peters and Van Voorhis (1940), 
to which frequent references have already been made. 


The Meaning of Factors 


Still, without attempting to discuss these complexities in any 
Comprehensive sense (either mathematically or otherwise), we 
May yet point out that factor methods, of whatever sort, in 
educational areas have two common characteristics. They all 


attempt to estimate the amount of generality or overlapping 


(concept and terminology varying with the theory) which exists 


11 menta] performance, and they depend upon some type of 
correlation among test scores for their basic data. Thurstone’s 
centroid” method represents one means of isolating, within a 
Complex multiple test situation, a series of indices or clusters 
Which are clearly related to each other. Also it yields the loading 
or relative contribution which each factor theoretically con- 
tributes to performance on a given task. In any battery, some 
Опе underlying factor (call it what you will) may have a posi- 
ive weight in respect to every subtest, while other factors may 
м great variance with respect to different subdivisions of 

Шу, 

The factors successiv 1 
Шау have little or no initial meaning; 1.6. 
out in advance under descriptive terms ( ; spa 

ut simply appear in the process. In order to distinguish among 
em somehow, they can be called, e.g., I, П, ІШ, IV, ete., with- 


ely emerging through such analyses 
, they are not sought 
verbal, spatial, etc.) 
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out immediate concern as to significance or applicability of the 
findings. They must be derived as entities before they can be de- 
fined. Whatever meaning (psychological or otherwise) the sev- 
eral factors have is attached to them subsequently through sub- 
jective judgment and inspection of what they retroactively seem 
to possess.7® 

This is an oversimplified statement of the practice most com- 
monly employed in factoring mental tests: first to isolate rela- 
tively pure clusters through internal analysis of the data, and 
to label them afterwards as a matter of convenience. The statis- 
tical method does not dictate any special procedure of inter- 
pretation and is itself free from preconceptions regarding what 
results it may yield. Such techniques, of course, are employed in 
many types of investigation, mental or physical—for example, 
in experimenting among forces whose nature is more clearly un- 
derstood and defined than certain human characteristics are. 

Whatever hypothetical attributes (memory, induction, num- 
ber sense, suggestibility, dominance, introversion, etc.) are thus 
factored out, result from study of the particular test materials 
having the highest loading for each of the traits isolated in turn; 
and what one chooses to call them. Thus subjectivity, or pre- 
dilection of the investigator, enters at the crucial, ultimate 
stage. However refined the various methods of factor analysis 
may be, they are alike erected on a sometimes shaky base of 
initial correlation coefficients; and the terms eventually used 
to label each factor for practical purposes have little objective 
determinance. Hence one should realize that, for all its flavor 0 
precision while in the making, factorial analysis of a test bat- 
tery can be no more reliable than are the data from which it 
stems; also that the name and nature of any factor isolated гер” 
resents, after all, little more than a shrewd guess by the analyst 
ав to what its true entity or psychological significance may be.” 


F. REVIEW AND INTERPRETIVE PRECAUTIONS 


We have attempted to outline, however sketchily, major as- 
pects of statistical analysis as commonly applied to menta 
measurements. The devices most frequently employed in report- 


18. Cf. the discussion in Cha imary 
Гер ter VI of Thurstone's research on Prima 

Mental Abilities and the proced: > і e been 
successively extracted. 5 ure for naming the factors after they hav 

19. ілік present chapter is intended primarily for readers with little back- 
Беде in statistical methods, duplication occurs between the foregoing 4M 
ater remarks (in Chapter VI) on “factor” determination. The major point 
are perhaps sufüciently important to warrant repetition. 
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ing test data include, for example, measures of: central tend- 
ерсу (averages ог medians) ; spread or dispersion; correlation 
(reliability and validity) ; indices of accuracy (errors of esti- 
mate or confidence levels) ; predictive power as enhanced by 
multiple correlation methods (teamed composites); factorial 
analysis, etc. It is important to realize that none of these tech- 
niques can yield results of universal, inherent significance. 

Тһе means or sigmas reported for test scores, however ас- 
curately determined, merely reflect tendencies within certain 
groups. Other groups may or may not resemble the latter. 
That appears fairly obvious with respect alike to average and 
range of performance. For example, so-called national norms 
on the American Council Psychological Examination (repre- 
senting as they do a conglomerate population from many differ- 
ent colleges) would prove of questionable value in educational 
guidance, for example, of Harvard freshmen. 'T'he latter are so 
highly selected that thorough discrimination among them can- 
not be accomplished merely by reference to such polyglot data. 
Substantial evidence suggests that the upper half of Harvard 
freshmen would rank within the top five or ten per cent of all 
college students taking this examination; hence norms so in- 
appropriate for that University would not serve to identify 
Superior matriculants there. 

While perhaps no educator would serious 
or percentile ranks to have a fiwed meaning, irrespective of the 
group levels within which they were attained, just that sort of 

xed meaning is often attributed, by implication at least, to 
correlation coefficients. Statements are frequently published to 
the effect that “validity of this test is .60 and its reliability 
90,» or “the measure correlates .45 with mechanical ability.” 
n such instances, “is” and “correlates” suggest an absolute 
Condition which simply does not exist in a permanent or uni- 
Versa] sense. What was determined for a certain group in a 
Certain situation will not necessarily hold for some other group 
Under different conditions. Even the reliability, or factorial 
Weights, of a test (though both determined by internal and 
therefore presumably rather stable evidence) may vary to a 
Significant degree for different populations or under varying 
Administrative conditions. 4 

Validity is still more ephemeral. As emphasized before, the 
Validity of a measure can be much affected by influences quite 
extraneous to itself and therefore has no inherent or necessarily 

Tue and certain nature. Yet the temptation to speak as if it had 


ly expect averages 
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(in discussing test evidence, for example, without specifying 
other data such as the level and range of ability represented) is 
great; partially because the recital of all evidence pertinent to 
appraisal of a relationship may involve distracting or laborious 
digressions. Hence it is frequently omitted or slurred over. The 
writer pleads guilty to such inadvertence, both in previous arti- 
cles and in some sections of this study. However, for many of the 
correlations to be cited therein, no comparable data as to ability 
or range of the groups represented are available for more ade- 
quate interpretation of validities than mere coefficients offer. 
When these are not supported or defined with due reference to 
level or dispersion, the foregoing reservations should be kept in 
mind. 


Validity and Reliability Not Absolutes 


To repeat, most correlation coefficients (and especially those 
dealing with validity) have a relative, not an absolute, signifi- 
cance. They are especially susceptible to differences in range of 
ability. Thus a certain test (e.g., of verbal comprehension) 
might yield successive r’s of .80 with appropriate criteria in the 
tenth or eleventh grade, .50 in college, and only .80 in law 
school. The same instrument, even if it possessed equal inherent 
validity for these stages, would show progressively less statisti- 
cal validity at successive higher acedemic levels, merely because 
of increasing homogeneity (decreasing spread) among the sev- 
eral populations, in terms of what it measures. ; 

As selection by whatever means is enhanced, subsequent dis- 
crimination becomes more difficult. The true magnitude of any 
correlation coefficient is, in mental testing, hard to determine 
(even assuming highly reliable criteria, such as are all too 
scarce) because of possible wide differences in the composition 
of groups tested. Therefore it cannot be too strongly empha- 
sized that reported correlations, such as those which will be 
given for many instruments throughout subsequent chapters» 
and countless others to be found in educational literature of the 
past decade, do not represent necessarily stable or іпітіп810 
values. They afford an index of how much dependence can 06 
placed upon guidance measures under circumstances like thos’ 
wherein the original correlations with appropriate criteria were 
derived. J ust because some test acted in a certain way (yielde 
a high т) in X situation, it should not be assumed that it will 
operate likewise in situation Y or Z. This stricture applies is 
reliability as well as to validity determinations. 
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То be meaningful, reliability and validity should, whenever 
possible, be determined locallg and anez (even within successive 
classes at the same institution) rather than by mere inference 
from previously obtained data. Unless clear evidence exists 
that such inference is actually justified by а known compara- 
bility? among testing conditions, range of talent, criteria, 
etc. from one situation to another, it cannot be taken for 
granted. These are hard terms which often cannot be met be- 
cause of inescapable restrictions upon testing time or costs; 
but it is well to strive for them as ideals and to maintain a criti- 
cal, interpretive attitude, especially when individual data are 
being considered. Those which are presented in succeeding 
chapters are, of course, all subject to that stricture. Since it 
Would often be inappropriate, if not impossible, for the writer 
to evaluate these personally, he issues herewith a general caveat 
along the lines above to all readers. 

SSS 

One will find in succeeding chapters a repetition or expan- 
sion of certain statements already made in connection with dis- 
cussion of various formulas in the present chapter. In many 
Cases such repetition is unavoidable because of the necessity for 
Stressing certain points particularly appropriate in connec- 
tion with the immediate data under consideration. 
ct similarity; reasonable inference may 


often be drawn, even between groups of different composition, if their respec- 
tive means and the range of deviations therefrom are known and properly 
aken into account. Cf. Crawford and Burnham (1944). 


20. Comparability does not mean stri 


CHAPTER III 


DIFFERENTIAL APTITUDE AND GENERAL 
INTELLIGENCE TESTS 


I for the purposes of this volume, designates an index 

of relative ability to acquire the knowledge and per- 
form the kind of thinking demanded by some distinctive branch 
of collegiate study. Overlapping to be found among educational 
and vocational, aptitude and achievement, or “general intelli- 
gence” test devices has already been mentioned and will be 
discussed further in this and subsequent chapters. Such over- 
lapping is genuine, and the resultant confusion sometimes 
found between these terms quite understandable. Thus an 
achievement test of whatever form, if at all valid, should serve 
to indicate individuals’ relative promise (or, in that sense, apti- 
tude) for more advanced study in the same or a cognate field. 
Properly developed essay examinations, as well as Cooperative 
Tests or other objective-type instruments, can serve not only 
the function of evaluating past acquirements but also that of 
predicting future progress. There indeed lies their real mission, 
though it is not always so recognized. 

Tests primarily directed toward measuring educational 
achievements are limited by what the individual has already 
had a chance to achieve. Therefore, as we have previously said 
and shall reiterate with more specific illustrations, these serve 
as valid indices of his readiness-to-learn only in fields to which 
he has earlier been adequately introduced. A basic assumption 
proposed in Chapter I is that variant curricula represented in 
our collegiate institutions and professional schools call for dif- 
ferential modes of thinking. Yet our educational system, par- 
ticularly in its more classical forms, offers no guaranty of a 
student’s comprehensive exposure to these disparate mental 
processes. For many individuals this is no loss, since they do not 
possess differences of significant degree within themselves of 
aptitude for one major field rather than for another. Some ap- 
pear equally superior in all; more are average throughout; 
others are uniformly dull, at least in any academic sense. Our 
second assumption is that a substantial proportion—at a guess, 
about one-third—of our youth do possess distinctly greater ed- 


? | АНЕ term “educational aptitude,” as defined in Chapter 
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ucability in one particular direction than in others. Wherever 
such internal variations in learning capacities exist, it is highly 
important that they be recognized. It is the aim of aptitude 
tests to reveal those possibilities where the usual curriculum 
(and, therefore, achievement measures largely dependent upon 
the latter's content) has not already discovered them. 

High authority for this point of view is represented by 
' James B. Conant's report for the year 1936—1937 as President 
of Harvard University: *A liberal education is possible, it 
seems to me, only in an atmosphere of tolerance engendered by 
the presence of many men of many minds. The future architect 
and poet are quite as important as the future lawyer, the future 
engineer or research scientist as essential as the future business 
man. То assure a well-balanced community, therefore, the 
criteria of admission to Harvard College must not unduly 
Stress those qualities which are of primary importance to the 
able lawyer and less essential to a novelist or a musician. Above 
all, the fatal error must be avoided of excluding a promising 
man of unilateral power. It is one thing to require a student to 
strengthen those intellectual muscles that are flabby from lack 
of exercise or congenital weakness; it is quite another to elimi- 
nate such an unevenly developed individual from further com- 
petition." (Conant, 1938, рр. 10-11.) 


GENERAL VERSUS DIFFERENTIAL PROGNOSIS 


The objective of educational aptitude tests may be further 
clarified by indicating how they differ from general intelli- 
gence measures. These, like aptitude tests, are forward-looking 
and designed to predict future academic performance. No one 
can deny that they have served a valuable purpose in facili- 
tating the selection and guidance of students, particularly at 
the secondary-school and college-entrance levels. The best of 
them yield a reliable index of mental alertness, or broad edu- 
cability in the academic sense. It should be noted, however, that 
such tests are largely nondifferential in nature. Whichever of 
the familiar labels they bear—IQ, General Intelligence, Scho- 
lastic Aptitude, Mental Rating, College Index—all are based 
on what might be called а *quantum theory" of educability. Ас- 
cording to that concept, every individual may be said to pos- 
sess, in greater or lesser extent, certain quanta of generalized 
learning power or scholastic efficiency. А measure of these 
quanta is thus regarded as equally significant for the predic- 
tion of achievement іп any field of intellectual endeavor. This 
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philosophy does not allow for the fact that quite different types 
of thinking ability (relative quanta) are required for the suc- 
cessful pursuit of studies in such varying fields as, for example, 
literature, economics, organic chemistry, thermodynamics and 
architecture. 


Brigham on Binet 


The standardized, undiscriminating use of general intelli- 
gence tests and the assumptions underlying this procedure 
were trenchantly attacked by the late Carl C. Brigham in A 
Study of Error. For example, in discussing Binet’s work and 
its effect upon subsequent testing procedures, he writes: “No 
one has ever demonstrated that it is legitimate to add such ap- 
parently heterogeneous performances as detecting absurdities, 
arranging a set of five weights differing by three grams, and 
copying a design from memory after a ten-second exposure; 
yet all scales were based on the assumption that this very ques- 
tionable procedure was legitimate. . . . The standardized scale 
draws attention to the total score on the scale. This total score 
differently reached by different individuals, after transmutation 
into a mental age scale and division by chronological age, be- 
comes an end in itself. And one may not tinker with a stand- 
ardized scale. Further experimentation is blocked by the act of 
standardization." (Brigham, 1932, pp. 23-24.) 

At another point he says, in similar vein: “General intelli- 
gence seems merely something hypostatized to explain test 
scores. The conventional practice of a tester of adding all of his 
scores into a single total score made it necessary to hypostatize 
"m intelligence and not specific intelligences.” (Idem, р. 

It would be improper to suggest that the passages just 
quoted from Brigham either represent his entire attitude to- 
ward general intelligence measures, or otherwise take full ac- 
count of their prognostic value. Nevertheless, in pointing out 
the nondifferential character of many widely used intelligence 
Бы gs lastic aptitude tests of the present day, these statements 
are distinctly significant, Mention of “specific intelligences” 
may reasonably be associated with just the sort of variance 
which educational aptitude tests are intended to measure. 
Thur stone later expressed much the same view: 

For many years psychologists have been accustomed to the 
problems of special abilities and disabilities. These are, in fact, 
the principal concern of school psychologists who deal with 
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children who cannot read, with children who have a blind spot 
for numbers, or with children who do one thing remarkably 
well and other things poorly. It seems strange that, with all this 
experience in differential psychology, we have clung so long to 
the practice of summarizing a child's mental endowment by a 
single index, such as the mental age, the intelligence quotient, 
the percentile rank in general intelligence, and other single 
average measures. Ап average index of mental endowment 
should be useful for many educational purposes, but it should 
not be regarded as more than the average of several tests. "Гһе 
error that is frequently made is that the intelligence quotient is 
sanctified by the assumption that it measures some basic func- 
tional unity, when it is known to be nothing more than a com- 
posite of many functional unities.” (Thurstone, 1941a, p. 8.) 

Brief discussion of the IQ concept (to which both the writers 
just cited refer) and its possible significance for educational 
guidance will follow in due course. Later, specific attention will 
be given to certain general intelligence tests of the sort now 
widely employed as prognostic measures at high school and 
college freshman levels. These are in a sense more specialized 
than the original IQ tests were, but less so than differential ap- 
titude or primary ability indices of still more recent develop- 


ment. 


GENERAL INTELLIGENCE 
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neral intelligence measures have achieved 
stic proficiency has been due in 
alidation against academic per- 
d freshman levels. At most 


THE VERBAL FACTOR 1+ 


_ Such success as ge 
In predicting average schola. 
part to their characteristic У 


formance at college-preparatory an 1 
institutions the course of study for those years, though intro- 


ductory to later diversified curricula, has long been largely 
verbal in content—emphasizing in secondary school English, 
history and languages at the expense of science or even mathe- 
matics in a ratio of about 3 to 1. Certainly, and perhaps logi- 
cally, earlier intellectual demands are much less varied than are 
those made by the subsequent areas of concentration to which 
that formative period of education leads. For better or worse 
these academic traditions, like the “old school tie and all it im- 
plies, have been deeply affected through the impact of war. 'The 
comments which follow in this and subsequent chapters are nec- 


essarily based in great part upon earlier data. 
As a consequence of standardization 1n subject matter at pre- 
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paratory levels and resultant emphasis upon analogous criteria, 
the commonly used tests of scholastic aptitude have themselves 
become more and more heavily saturated with verbal material. 
Even though general intelligence tests are not measures of ver- 
bal aptitude alone (since they contain a considerable admixture 
of nonverbal elements), the latter are usually not present in 
sufficient quantity to provide adequate differential measure- 
ment. Such instruments, though heterogeneous in their make- 
up, are not specifically discriminating. The single merged 
score of academic promise which they yield is a hodgepodge, 
largely affected (to an indeterminate degree) by verbal facility, 
but seldom appraising even that mental function with precision. 
Other elements just mentioned tend to obscure this in a total 
score, without offering much specific value of their own. 


THE INTELLIGENCE QUOTIENT 


Our primary objective is the evaluation of differential apti- 
tude tests. Yet, for purposes of distinction, these need to be 
considered within the total psychometric setting and in rela- 
tionship to other measures and concepts. The IQ, which has al- 
ready been mentioned in passing but not yet defined, is some- 
times quite improperly employed as a generic term for almost 
any objective-type examination. 

Space does not permit adequate discussion of this concept; 
originating from Binet’s pioneer studies of intelligence among 
feeble-minded subjects and children. Perhaps the best histori- 
cal account of Binet’s extensive work and developments result- 
ing therefrom, together with a wealth of bibliographical mate- 
rial, will be found in Peterson’s (1925) Early Conceptions and 
Tests of Intelligence. Pintner’s meaty little book, Intelligence 
Testing’ (1923), published two years earlier, contains an 
excellent review of this and related topics. A study by Quinn 


McNemar (1942) deals with later revisions of the Stanford— 
Binet Scale. 


What Is the IQ? 


Discussion of the IQ, strictly speaking, may seem out of 
place in this volume; first because the term implies general 

' rather than specialized educability, and second because it 18 
more appropriate to earlier educational levels than to those 


1. Cf. especially Chapter iv, “Тһе Concept of General Intelligence," and 
theoretical growth curves of “superior, average and inferior” persons in this 


respect (p. 68). 
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with which we are chiefly concerned. Understanding of its na- 
ture is, however, essential to later consideration of more spe- 
cialized abilities. 

The curve of increment on simple “mental alertness” tests 
such as Binet devised and others have improved rises sharply 
for normal children up to about 12 years of age and then be-. 
gins to level off ; somewhere between 14 and 16 years it reaches 
a plateau. This of course does not imply cessation or retarda- 
tion of the individual's learning at adolescence; it simply means 
that certain neural and other processes essential for the acquire- 
ment of advanced knowledge and training in occupational skills 
have by that time matured. One becomes adult in basic mental 
equipment before physical growth is complete; but ultimate 
intellectual stature is determined by the consequent use made 
of that equipment. 

The main point h 
entire concept of an intell 
early, formative years. “Quotient” 

i ..  mentalage _ with 100 (equivalence be- 
In this case IQ = ЕА ae ^N (eq 
tween the two) representing average intelligence ; and e.g., 80 
dull, 120 superior, 160 exceptionally brilliant minds. The IQ 
designates brightness, rather than eventual scope of knowledge; 
capacity for intellectual progress, not final attainment in either 
breadth or altitude. Psychologists differ to some extent in their 
opinions as to when progressive development of sheer bright- 
ness terminates; it probably varies somewhat among individu- 
als, as does the age at which their maximum physical height is 
later attained. А 

There has been a great deal of argument, often rather acri- 
monious, regarding “constancy of the 1Q”—that is, whether 
the ratio between mental and chronological age remains ap- 
proximately constant throughout childhood and adolescent 
years. We shall not even attempt to discuss that moot point. 
Assuming, however, that neural development (as measured by 
simple, unspecialized tests) parallels physical growth up to 14 
or 16 years but no further (except in the enrichment of knowl- 
edge, which is а different matter), it is clear that the concept 
of an intelligence quotient, or ratio between successive mental 
and chronological ages, loses significance after the numerator 
has become stabilized. This situation can be met in part through 
also stabilizing the denominator, as is commonly done by divid- 
ing the mental age score attained on IQ tests by not more than 


ere is that tests of the Binet type and the 
igence quotient properly relate to 
of course denotes a ratio. 


82 Forecasting College Achievement 


16 years on the chronological scale; thus average attainment on 
tests of this nature by a man of 32 still gives him an IQ of 100 
rather than 50. Nevertheless, many authorities have challenged 
legitimacy for measurement purposes of the IQ concept or in- 
dex, even at senior high-school or college freshman levels, how- 
ever useful it may be at earlier stages. This criticism is based 
upon a belief that the idea of varying quotients retains com- 
paratively little meaning after both their components are, so 
to speak, frozen. 

Many. students, teachers or counselors—even some test con- 
structors—use the term IQ quite indiscriminately. The writer 
has heard it applied to such varied instruments, for example, 
as the College Board’s Scholastic Aptitude Test, the American 
Council Psychological Examination, the Yale Aptitude Bat- 
tery, a series of Cooperative Tests and the Graduate Record 
Examination—none of which involves an accomplishment/age 
ratio, which is what the Intelligence Quotient properly repre- 
sents. 


Which IQ Is Whose? 


This question may well be asked whenever an IQ is cited by 
schools, parents, or on occasion even by the candidate himself. 
Because this index is often so carelessly applied, it becomes all 
the more necessary to ascertain what sort of test or battery has 
been utilized for its appraisal. Some measures scored in IQ 
terms (like the well-known Otis Test, for example) are in fact 
general intelligence tests, largely saturated with verbal ele- 
ments; others retain the original form of appraising perform- 
ance on tasks of a less specialized nature. Consequently one so- 
called IQ determination may differ considerably from another, 
both in nature and in results. 

Traxler has published an interesting comparison of IQ scores 
for the same individuals as obtained from different tests and at 
different times. The IQ’s made on the new edition of the Kuhl- 
mann—Anderson tests by 421 elementary-school pupils were 
compared with scores made on the Binet tests by the same stu- 
dents. 

Traxler’s findings with respect to correlation between the 
two tests are stated as follows: “The correlations between the 
Binet IQ's of the independerit-school pupils and their IQ’s on 
the fifth edition of the Kuhlmann—Anderson test are about .60 
to .65. Although these correlations are not very high for two 
tests of intelligence, they are a little higher than those between 
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the Binet IQ's and the IQ's based on the fourth edition.” (Trax- 
ler, 1941a, p. 32. Cf. also Terman, 1916; Kuhlmann and An- 
derson, 1939.) 

No criticism is here implied of either measure as such, or in- 
deed of others in the same category. The present comments are 
intended chiefly to demonstrate variability of the sacrosanct IQ 
as obtained by different means. The long-fought controversy 
among psychologists regarding its constancy when successively 
measured by the same method (e.g., Stanford-Binet or analo- 
gous tests) from early childhood onward represents quite a 
separate problem. Difficult as that alone is of solution, it obvi- 
ously becomes even more complicated when IQ determinations, 
for the same pupils by different means, themselves vary to the 
exent indicated by Traxler’s investigation. 

Our purpose in emphasizing this point should be clear. So 
much has been written about the IQ and its alleged stability 
that many persons regard the designation as having some fixed 
meaning. Yet if two or more tests yield IQ Scores which corre- 
late with each other only little better than .60, it becomes evi- 
dent that the terms “intelligence quotient” or “mental age" are 
much too loosely used in ordinary parlance and popular con- 
ception. To repeat, the meaning of an IQ for individual guid- 
ance purposes depends in no small degree upon the particular 
instrument employed for its appraisal. | | 

Traxler states further: “Тһе correlations are substantial but 
they are not very high for two tests designed to measure the à 
same thing—general intelligence. Among the factors tending 
to cause the correlations to be lower than they might otherwise 
have been are the grouping of the pupils together regardless of 
grade level, the large amount of time which elapsed in some 
cases between the administration of the Binet tests and the 
Kuhlmann—Anderson tests, and the fact that the Binet tests 
were given and scored by many different examiners.” (Traxler, 
1941a, pp. 30-31.) 

The coefficients reported by Traxler between successive IQ 
ratings for the same pupils by different tests (over an un- 
specified but evidently not long interval within the primary 
grades) seem quite low. The writer firmly believes that (at least 
for secondary and college levels) IQ and its concomitant MA 
(Mental Age) are dangerously susceptible to misinterpreta- 
tion and could well be now retired emeriti from academic circles. 
In their place we should have, ideally, specific reports of rela- 
tive standing for each student on- various differential tests, 
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themselves adequately described in terms of both level and 
range. 


THE ARMY ALPHA 


The Army Alpha Test undoubtedly merited the great 
amount of attention which it was accorded immediately follow- 
ing World War I. Anyone interested in the history and use of 
the Alpha examination (for literates) and Beta (for illiterates) 
is referred to Yerkes’ monumental work (1921). The Alpha 
test contained a large amount of verbal material and usually 
has been regarded as a measure of general intelligence. In the 
broad range of ability found in the Army, it performed a valu- 
able function in differentiating rapid learners from slower ones. 
This, of course, is important to any classification scheme for a 
program wherein large numbers of men and women hurriedly 
receive specialized training. 

It was in some ways unfortunate that the Army Alpha re- 
ceived an overenthusiastic reception in education and in per- 
sonnel work with adults immediately following the war. There 
were many disappointments connected with its subsequent use 
in civilian life, but much was learned thereby. For instance, the 
Army Alpha proved to be of little value in differentiating 
among college freshmen, for it had not been designed for this 
purpose. It is interesting to recall that the median years of 
schooling reported by the native-born white draft was 6.9 
- grades; that of the southern Negro draft 2.6 grades. (Yerkes, 
1921, p. 761.) In the light of this, it does not seem surprising 
that a test designed for differentiation at this lower level should 
not prove discriminating in college groups with 12 grades of 
schooling. ; 

Although somewhat outmoded in its original form, the Army 
Alpha can by no means be considered obsolescent. Following 
much experimental work, at least two rather well-known re- 
visions have been published. Atwell and Wells (1933) reported 
that a shortened form (16 instead of 35 minutes) had been 
developed which correlated from .74 to .83 with the original 
form and had a reliability varying from .72 (retest coefficient) 
to .80 (Spearman-Brown coefficient). 

Guilford (1988) produced a revision based on experience 
gained in using this test at the University of Nebraska. His 
scoring scale yields separate scores on three factors, viz: V (ver- 
bal); N (numerical) ; and R (relations, 1.е., the ability to rec- 
ognize *simple logical relationships"). Norms were established 
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for а student population at the University of Nebraska, but 
no data seem to have been published on the intercorrelations of 
factors, internal reliability or validation with respect to grades 
or other external criteria. 


Testing in World War II 


Тһе Army General Classification Test of World War II 
(U. S. War Department, 1942) and associated special meas- 
ures including those developed by the Navy, illustrate the mod- 
ern trend toward differential indices. Data available on these 
measures (Psychology and the War, 1942 ff.) describe their 
effectiveness in service training situations. 

Sufficient time has not elapsed for publication of studies re- 
garding their use in other circumstances. Indeed the principal 
contributions to education in general, made by the Army and 
Navy testing programs, probably reside in the methodology 
and techniques developed rather than in the tests themselves. 
These were designed for maximum operating effectiveness in 
a service situation and cannot reasonably be expected to per- 
form so well in a different setting. However, certain of the spe- 
cialized *aptitude" and “trade knowledge" measures described 
in the War Department's technical manual cited above might 
prove almost equally useful in selection for industrial jobs. 
Тһе same or analogous tests would, of course, require standard- 
ization upon new norms drawn from the respective industrial 


populations. 
THE CAVD 


Among intelligence tests published in the decade 1920—1929, 
Thorndike’s CAVD (Completion, Arithmetic, Vocabulary, Di- 
rections) attracted much attention. Its unique features warrant 
Specific consideration here. The object of Thorndike’s research 
circa 1922-1995 was to develop an instrument which would 
measure the “altitude” of intellect over the entire sweep from 
virtual imbecility to genius. А series of paper and pencil exer- 
cises, or tasks, was developed and then staggered over 17 levels 
of performance—A to Q inclusive, with some overlapping be- 
tween any one and adjacent groupings. Тһе difficulty of each 
item at a particular level was determined by a *consensus of ex- 
pert psychologists." Level A is described as of such simplicity 
that hal£ of its items were answered correctly by 88% of adult 
imbeciles whose mental ages ranged from 24% to 5 years. 'The 
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highest level, Q, is composed of items so difficult that only 23% 
of college graduates answered half of them correctly. 

The following statement from The Measurement of Intelli- 
gence? describes the nature of this test: 

“The total series of tasks concerns four lines of ability: 

“С: To supply words so as to make a statement true and 
sensible. 

“А: To solve arithmetical problems. 

“У: То understand single words. 

“D: To understand connected discourse as in oral directions 
or paragraph reading. 

“The arrangement of scoring is such as to attach equal 
weight to each of these four varieties of tasks." (Idem, p. 65.) 

Each level consists of 40 items divided equally among these 
four categories. The test is thus heavily weighted on the verbal 
side, since only one-fourth of its content is quantitative. The 
range of scores at any one level seems to be quite restricted, es- 
pecially when compared with the College Board's wide Scholas- 
tic Aptitude Test scale. Also, rigidity of the scoring key is such, 
in connection with certain mathematical problems, as possibly 
to affect both the reliability and the validity of this test. 'The 
following hypothetical item has been developed to serve illus- 
trative purposes in this connection : 


B xo 

4 8 

When а = 2;.с = 17.2; what is the value of b? 
Correct Answers: 20.6; 20 6/10 or 20 3/5 


Not acceptable: 41.2 
2 


Obviously the person who recorded as his answer 41.2 has not 


Given 7 -+ 


solved the problem quite so completely as is indicated by one 
of the acceptably correct answers. But for some purposes this 
may represent at least a usable if not an entirely adequate 
solution. The point at issue here is the familiar one of whether 
all credit should be withheld or full or partial credit given. 
There are a sufficient number of such items in the CAVD con- 
ceivably to account for as much cumulative variation as .5 


2. Thorndike, E. L., et al. (circa 1927, not dated) The Measurement of Intel- 
ligence, New York, Columbia University Bureau of Publications. 
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standard deviation in range of total scores among graduate 
students. So long as the scorer follows the rigid key of accept- 
able answers there is no subjectivity in scoring in the usual 
sense of the term. Yet there has certainly been a high degree of 
subjectivity in development of the scoring key. This is ex- 
tremely important in a test whose range of scores is so re- 
stricted. 

The range reported for 453 graduate students at Columbia 
University was 388 to 447, the median being about 417. (Idem, 
p. 871.) This may be contrasted with the SAT? range of 300 
to 800 (approximately) among Eastern college freshmen. Like- 
wise among entrants to one department of graduate study at 
Yale, the range over several years has been from 416 to 450 
on the CAVD and from 400 to 800 on the Carnegie Founda- 
tion’s GRE? Verbal Factor Test. 

Thorndike reports the interquartile range as varying from a 
maximum of 19 score points for ninth-grade subjects to a mini- 


mum of 8 points for Ph.D. candidates at Columbia (Ibid.). 


Except for the information just presented, little seems to have 
been published to facilitate judging the significance of particu- 
lar CAVD scores. For example, à difference between 425 and 
485 is much greater than one not familiar with this highly re- 
stricted scale might think ; but just what its significance is can- 
not be readily determined in the absence of published norms 
and estimates of reliability. In other words, the CAVD scale is 
not self-descriptive like that of the College Entrance Examina- 
tion Board. The latter is defined with respect to a well-known 
basic population having a mean of 500, standard deviation 100 


and reliability such that the probable error of measurement is 
approximately 15. 

One characteristic of the CAVD especially worthy of men- 
tion is its freedom from speed effects. No time limits are used, 
since the intention is to measure power rather than speed. 
-imental correlation of .40 between 


Thorndike reports an exper 

Speed (time in taking the test) and altitude (difficulty) for 
Sixty-three university students. (Idem, p. 401.) Another strik- 
ing feature of the CAVD project is the paucity, despite its 
long use, of validation data thereon with respect to scholastic 
prediction. The test is apparently being employed by other 


institutions than Columbia University; yet for some unknown 
8. SAT is a familiar abbreviation for the College Entrance Examination 


Board's Scholastic Aptitude Test and GRE for the Carnegic Foundation's 
Graduate Record Examination. Both аге later discussed in some detail. 
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reason little or nothing has been published about its efficacy. 
ence it is manifestly impossible to appraise diagnostic value 
of the CAVD in either a general or differential sense. 

While designed to measure the whole range of intelligence, 
the CAVD test is by no means the only instrument which has 
this same objective. In 1939, Wechsler published the first edi- 
tion of The Measurement of Adult Intelligence, which described 
the Wechsler Mental Ability Scale. This is less academic in na- 
ture than many of the intelligence tests described in this chap- 
ter and covers a wide range of intellectual abilities, motor as 
well as verbal. It was standardized on a broad sample, with par- 
ticular emphasis on reliable measurement at the lower levels of 
ability, thus making it particularly useful in clinical work. 
Goldfarb (1944) has investigated its discrimination among su- 
perior adolescents and characterized the Wechsler Mental 
Ability Scale as "relatively ineffective? in this respect. Altus 
(1945) conversely reported it as very useful in predicting 


graduation and discharge for trainees іп an Army Special 
Training Center, 


INTELLIGENCE TESTS AND SCHOLASTIC 
ACHIEVEMENT 

Apart from the CAVD, a great mass of data exists as to the 
relationship of academic prognosis to various IQ, general intel- 
ligence or scholastic aptitude tests.* If, as we have suggested, 
these yield but hodgepodge indices of mental capacity, then it 
is not surprising that their maximum effectiveness has gener- 
ally been found in respect to some equally conglomerate РЕ 
terion—i.e., average grades, rank-in-class, honor-point ratio or 
other composite performance Standards. In effect, most corre- 
lation studies of this familiar type consist of matching largely 
verbal test scores diluted with some numerical or spatial ele- 
ments against largely verbal grade averages, likewise diluted 

with a little mathematics or science. E 
өте Tealistic appraisal of academic forecasting efficiency 
should follow when separate Prognostic indices are correlated 
with appropriately Separate criteria, This possibility, howeve T, 
requires a. considerable lengthening of the time devoted to in- 
dividual testing. Part-score combinations thus far assembled 
from even the best-known generalized instruments (e.g., the 
American Council Psychological Examination discussed later) 


4. For an excellent Teview of intelligence testing generally and pertinent 
comments on factorial analysis thereof, see Cattell (1943). 
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have proved rather unsatisfactory for differential guidance 
purposes, because the individual sections (subtests) comprising 
them are usually too short for reliable, directive use. If, as seems 
reasonably well established, 45 to 60 minutes are required for 
the dependable evaluation of composite average promise, one 
can hardly expect each of several more specialized talents to 
be reliably appraised in half that time or less. While certain 
ad hoc skills (e.g., spelling, manual dexterity, arithmetic, cleri- 
cal work, etc.) can be quickly measured, it is a fallacy to assume 
that subdivisions of the intellectual learning process may simi- 
larly be foreshortened. Hence five or six hours of testing time 
are needed for thorough “all-over” evaluation. 


Intelligence Test Correlations with Academic Grades 


Mention was made above of the extensive data available re- 
garding tests of general intelligence (under various labels) as 
related to scholastic performance. Little purpose would be 
served by presenting here what would necessarily be but a small 
sampling of these voluminous data. To state that typical cor- 
relations with school or college averages run between .40 and 
.50 is a rough though fair generalization ; considerably lower 
or higher coefficients than usually found have at times been re- 
ported even for identical measures in different administrations. 
Differences in ability of the groups examined account for seem- 
ing vagaries among the many tests of this sort. The more widely 
they are used, the greater chance there is for atypical coeffi- 
to'sports in biological parlance—to develop. 
r be emphasized, a particular instrument 
11 in the situation for which it has been 
Imost useless elsewhere, even at ostensibly 


cients—analogous 
Moreover, as will late 
which serves quite we 
developed may prove а. 
the same educational level. . 

Lest this bald statement appear too sweeping, one factual 
illustration (out of many which might be given) is offered inits 
support. Several years ago an Eastern university wished to 
develop some means for appraising the relative promise of 
scholarship applicants on a national basis and earlier in the 
year than College Board evidence, previously relied upon. for 
that purpose, would be available. It was therefore determined 
to hold a screening test in March, and make tentative awards 
on the basis of school record and test scores in the spring, so 
that candidates for financial aid could lay their plans with 
some assurance well in advance of taking the “College Boards. 
(This action, incidentally, soon led to establishment of the ob- 
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jective-type CEEB April Tests and in turn to the analogous 
series now held four times a year.) 

The Ohio State Psychological "est? was selected for this 
experiment and administered to all candidates for aid—about 
four times as many as the number of available scholarships. 
When the results were referred to previously established 
(State) freshman norms it was found, quite surprisingly, that 
the mean score of several hundred nation-wide scholarship ap- 
plicants ranked at percentile 98 on that scale. Hence the Ohio 
Examination (a carefully developed scholastic intelligence 
measure which has long served useful aims in its own locale) for 
this new purpose afforded no workable discrimination where it 
was most important—i.e., within the top half at least, of that 
particular group. The candidates finally chosen for financial 
aid ranked, on the average, at the ninety-ninth (State) test 
percentile. Lack of adequate spread in their test scores natu- 
rally made subsequent correlations thereof with freshman 
grades at the institution in question meaningless. The group’s 
self-selectivity to start with had virtually destroyed all effec- 
tiveness of an instrument well constructed to measure academic 
promise over a much wider range. 

The foregoing example has been given to indicate why, as 
earlier stated, little would be gained by citing a long list of co- 
efficients bearing upon the relationship of scholastic grades to 
Scores on a number of general intelligence tests. Instead, one 
of these will be discussed at some length because numerous data 
are available regarding both total and part scores thereon. This 
instrument is the American Council Psychological Examination 
for College Freshmen developed by L. L. and T. G. Thurstone 
(1924ff.) at the Council’s request. It was first made available 


in 1924 anda companion, the analogous Examination for High 
School students, in 1933. 


THE AMERICAN COUNCIL PSYCHOLOGICAL 
EXAMINATION 


nt has been so widely utilized for years, at 
and collegiate levels, that specific discussion of 
arts may seem unnecessary. It represents quite 
пе typical nature of modern general intelligence 
considered in subsequent chapters with particu- 


5. The Ohio State Universit: 
lege Association, Columb: 


This instrume 
both high school 
its component p 
well, however, tl 
tests and will be 


y Psychological Test, published by the Ohio Col- 
us, Ohio (annually revised). 
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lar reference to the 1940 edition, which has been closely par- 
alleled in later forms. That consisted of six subtests, three 
yielding а “Q” (Quantitative) and three others an “L” (Lin- 
guistic) index. Combined, these give a so-called gross score. 
Тһе several parts are respectively designated as arithmetic, 
figure analogies and number series, constituting ће “Q” group; 
completion, same-opposite and verbal analogies, comprising the 
“L” series.° In each booklet, sample orientation or warm-up 
exercises are printed on separate pages, immediately preceding 
the respective actual tests. Since this instrument is unrestricted 
in circulation, and specimen copies may be obtained from the 
American Council on Education, it is unnecessary to reproduce 
illustrative materials here. 

Information regarding the College Form, including number 
of candidates tested, means and ranges for different institutions 
(at first identified by name and later listed under code numbers 
to preserve institutional anonymity) appeared annually (1925— 
1939) in the Educational Record. However, since 1940 it has 
been published each year as a separate Bulletin in the American 
Council on Education Studies (1940 #.). Standard deviations 
of the score distributions are now reported, whereas earlier in- 
dications of the test range were represented by Qu (25th per- 
centile), Median and Qs (75th percentile) for massed and in- 
dividual college populations. These data are extensive, but the 
institutions from which they derive are so variant in nature and 
widely different in educational standards that national norms 
based upon such a motley congregation seem difficult to inter- 
pret meaningfully for their respective separate groups: 

With respect to the questions of reliability and validity, the 
following statement is quoted from a comment on the 1931 
Psychological Examination: “Previous studies of reliability 
and validity have shown the reliability of the gross scores to 
be about .95 and the correlation between test scores and scholar- 
ship have averaged around .50 for a large number of colleges. 
Тһе schools using the tests are making many studies of the 
value of them.” (Thurstone, 1932, p. 235.) Some idea of the 


6. It is i ing to note that the original edition of the American Council 
on НА марне Examination (Thurstone, 1925) was ае 
the following subtests: 1. Completion, 2. Arithmetical Reasoning, 8. Arti Ж. 
Language, 4, Proverbs, 5. Reading, 6. Opposites, 7. Grammar, 8. Veg sm ан 
Reasoning. А profile form even then made provision for E "n by 
subtests, grouped into two differential categories: linguistic an Mies ws 
Norms for each subtest were based upon the records of approximately 6, 


students in twenty-five colleges. 
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extent to which this examination has been annually employed 
is given in the 1938 Report, which states in part: “Over 230,- 
000 test blanks have been ordered by approximately 600 col- 
leges. . . . Three hundred twenty-three colleges have reported 
scores of 68,899 students.” (Thurstone, 1938d, p. 209.) A later 
bulletin (Thurstone, 1943) discusses scores reported for 49,020 
students by 253 colleges, but these figures by no means include 
the full number of examinations distributed in that year. 
Among all the institutions from which basic norms for the 
1987 edition were derived, only three out of thirty-three then 
holding membership in the Association of American Universi- 
ties (highest nationally accredited educational status) were 
represented. In 1942, the corresponding representation was 
three out of thirty-four such institutions. Likewise, a small pro- 
portion of those normally requiring College Board Examina- 
tions for admission will be found on the list contributing to 
ACPE norms. Anyone scanning it will note, by contrast, a gen- 
erous proportion of junior, teachers’ and sectarian colleges. Be- 
ginning with 1940, the reporting colleges have been classified 
into four groups according to academic status and separate 
m published for each category ('Thurstone, 1948, pp. 6- 
0). 


ACPE NORMS 

Whether or not this Psychological Examination proves help- 
ful in differentiating the able from the average students at 
many institutions, it seems to offer little discriminating value 
for those which maintain more rigorous standards of selection. 
If half of their freshmen rank within the top 8 or 10 per cent 
of all college students so tested (and only a tenth below the 
top quarter) on published American Council norms, the latter 
serve little purpose for measurement and guidance needs at such 
institutions. The foregoing statement is not hypothetical; on 
trial of this examination with Yale freshmen, their average 
SCOTE WAS above the year’s composite 90th percentile. Moreover, 
nearly four-fifths ranked above Q3 (75th percentile) on the 
published college norms (specifically 78% within the national 
top quarter range) on two separate administrations, to sample 
groups carefully chosen as scholastically representative of the 
entire class. 

Since it would be difficult to persuade freshman instructors 
anywhere that “run of mine” entrants are intellectual super- 
men, the soundest premise remaining is that ACPE national 
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norms seem inadequate, so far as the measurement needs of 
many liberal arts colleges are concerned. Yet a bulletin issued 
in 1936 on the history and activities of the American Council 
states that its “Psychological Examination . . . represents 
one of the most comprehensive experiments in testing in the 
world.” (American Council on Education, 1986, p. 17.) If a 
population which represents only about one-tenth of the most 
outstanding educational institutions in this country, and none 
elsewhere, can be accepted as globally comprehensive, that 
somewhat complacent description might be justified. 

In recent years, an attempt has been made to subdivide or 
compare norms on this examination for different types of (more 
or less) higher learning. Thus in the 1943 Report earlier cited, 
separate distributions are given for: Type 1, four-year col- 
leges ; Туре 2, junior colleges; Type 8, teachers’ colleges; Type 
4, technical and professional schools. Differences appear 
throughout the several subgroups thus analyzed; yet it is evi- 
dent from Table 5A that wider contrasts exist within each of 
these than are found between them. If some such procedure as is 
represented by Flanagan's (1939) development of Scaled 
Scores or Toops’ (1939) suggestion for a “Standard Million” 
(both mentioned in Chapter II) were applied to ACPE rat- 
ings, value of the latter should be greatly enhanced for inter- 
pretive purposes. Admittedly, the preceding statement reflects 
a personal opinion; it cannot be substantiated by positive evi- 
dence because conditions of administration which might provide 


this have not been imposed. 


Gross score medians for all 323 institutions as reported in 


1988 (for the 1937 edition) ranged from 248 down to 92. For 
the 1942 edition, gross score means by institutions ranged from 
132 down to 39, thus showing that the scores were reported in 
quite different scale units for these years. Table 5A has been 
prepared to indicate some of the data which Thurstone presents 
in his report on the 1942 college edition. The mean score made 


by 49,020 students from all four institutional types was 102.37 
and the standard deviation 24.82. These norms are based on 
one's Table I (1943, pp. 6— 


258 colleges. A glance at Thurstone’s 1 
10), presenting the rank order of individual colleges and uni- 


versities by mean score, indicates no marked relationship be- 
tween the type of college and the scores obtained. Table 5A 
shows gross score means and standard deviations for each of 
the four different types of institutions. The range of these 
type means is surprisingly small—from 96.18 to 108.63. 
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TABLE 5A 


Gross Score Data on 1942 Edition of American Council 
Psychological Examination 


Number of | Number of 


Type of Institution Colleges Students Mean SD 
1. Four-Year Colleges 178 39,283 108.68 24.77 
2. Junior Colleges 57 3,048 96.80 23.85 
3. Teachers Colleges 29 4,600 96.13 24.52 
4. Technical and Professional 

Schools _9 2,089 100.47 94.68 
All Institutions 958 49,020 109.87 94.89 


Data for this table were abstracted from Thurstone (1948, pp. 13-25). 


Gross score means of Type 1 (four-year) colleges range 
from 132.41 and a corresponding percentile rank within the 
total 1942 reported population of 89, to 38.86 and а percentile 
rank of 1 (Thurstone, 1948, рр. 6—10). It is therefore ap- 
parent that Thurstone’s Type 1 institutions represent a hetero- 
geneous collection indeed, covering approximately 90% of the 
total range of means for all 253 colleges. 

Though included among Thurstone’s own data (and classi- 
fied as Type 1), the lowest ranking college should probably, 
in this discussion and all fairness, be disregarded as a “sport”; 
its mean is little more than half of that for the next lowest in- 
stitution. Even if culprit No. 253 is eliminated, however, the 
range of Type 1 college means (on this examination) extends 
from percentile 89 to 14. In other words, after four-year col- 
leges have been segregated, the variation in average ACPE 
scores has been reduced by only about one-fourth of that for 
all 253 colleges reporting scores in 1942, 


VALIDITY OF THE ACPE 

Validations of this test, and data as to discriminative power 
of its respective “Q” and “Т? sections, are not available in а 
comprehensive sense. It has already been pointed out that 
ACPE norms are based upon an indiscriminate educational 
sampling, not thoroughly representative and with little effec- 
tive control or definition. Hence for this instrument (as for 
many others of analogous nature) the correlation between test 
scores and various achievement criteria may be expected to 
differ widely from one institution to another, depending upon 
the range of ability among such variant student groups. The 
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few correlations cited later in Part II (with reference to “L” 
and “Q” scores respectively) are not high. Unfortunately, ade- 
quate data as to level and range of ability represented are lack- 
ing in most reports of this test's operational effectiveness. 

_ One early comment on the predictive value of the 1925 exam- 
ination, gives correlations between total test scores and fresh- 
man scholarship for 26 institutions with an average coefficient of 
-45 for all colleges reporting (Thurstone, 1927, р. 165). Most 
of the validation data available come from independent sources. 
However, one important report on the relationship of АСРЕ 
Scores to academic standing in а well-defined situation was pub- 
lished in 1939, with respect to earlier forms of this examina- 
tion. The data refer to “validity of the Examination for schol- 
arship in the College at the University of Chicago for several 
years." T'o quote further: “Since all students in the College are 
required to take the four general courses in the freshman and 
sophomore curriculum, and since the grades in these courses are 
determined entirely by the student performance on four six- 
hour examinations, these validity correlations are of some inter- 
est. In the table we have included also the intercorrelations of 
the four general course examinations. Ав a measure of general 
scholarship for these groups, we have taken the average of the 
four general course examinations. It will be seen that the cor- 
relations of the Psychological Examination and average schol- 
arship for freshmen and sophomores are approximately .50.* 
Тһе correlations for 1989-94 are based on from 200 to 2,000 
cases. For the other academic years, the correlations are based 
on from 400 to 600 cases.” (Thurstone, 1939, pp. 294—295.) 


S AMONG COLLEGE GRADES 


Table XVIII of the article cited is reproduced herewith in 
full (Table 5B). No data as to means or standard deviations 
were given. One of the most interesting aspects of this table 
is the relatively high intercorrelations (average .75) among 
grades in the four general courses. That result has a definite 
bearing upon the topic of general (scholastic) intelligence 
testing. Within the writer’s observation throughout a consider- 
able period, such relationships between each of several more or 
less disparate educational fields are seldom found at college 
freshman-sophomore levels. It seems probable, since all student 
grades in the college are determined by performance on stand- 


7. The separate coefficients represented by this generalization appear in 
Table 5B following. 


INTERCORRELATION 
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ardized examinations (largely of the objective type), that high 
reliability of such measures, and in addition possibly some gen- 
eralized test-taking ability, exercise а predominant influence 
in this situation." Correlations among subjectively determined 


TABLE 5B 


Correlations of the Psychological Examination with 
Scholarship * 


8 8 8 8 А 
85 В aS 8 8 

i8 қым, 

9» » S оь 8» S. 

ЕЗ SS di 45 A5 е 

ee 3i 33 88 «3 E 

55 НЕ БЕ Ер БВ Е 

Ед FE ия да A 2 
48 49 46 46 52 1933-34 
Psychological 46 46 39 48 .50 1934-35 
Examination AT AT 46 50 — 1935-36 
47 58 40 51 53 1936-37 
75 .80 78 — 1933-34 
Biological Sciences 82 72 84 — 1934-85 
Introductory Course — -- — — 1935-36 
75 81 .82 -- 1936-37 
.60 A5 — 1933-34 
Humanities .66 .82 -- 1984-95 
Introductory Course — — — 1935-36 
78 72 — 1936-37 
; 180 — 1933-34 
Physical Sciences .67 -- 1084-35 
Introductory Course — — 1935-36 
.66 — 1986-87 


* Reprinted from Thurstone (1939, p. 294). 


8. The methods of examination, and of promotion on the basis of demon- 
strated achievement, so carefully developed under the guidance of L. L. Thur- 
stone and later Ralph W. Tyler (1942) as University Examiners with their re- 
spective associates merit highest acclaim. They represent a system of scholastic 
appraisal, within certain limitations, which has not elsewhere been so thoroughly 
achieved. Yet this system, as implied above, may tend to obscure pertinent in- 
dividual differences. The topic of "test-taking ability," with particular refer- 
ence to multiple-choice responses on separate answer sheets, herein is discussed 
more fully (Chapter VI, pp. 209-210). A recent personal communication from 
Dr. B. S. Bloom, Examiner at The University of Chicago, describes the high re- 
liability coefficients found in comprehensive examinations: *For a number of 
years we computed the coefficient of reliability on each of the comprehensive 
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freshman grades—at about the same period of time and at 
Yale rather than at the University of Chicago—bear out this 
opinion. Тһе Yale intercorrelations reported in Table 6 are 
decidedly lower than the corresponding University of Chicago 
coefficients shown in Table 5B. 

Additional pertinent data have been secured for a group of 
Navy students who entered Yale in March, 1944. At the com- 
pletion of their first term, special Navy achievement tests were 
administered as more fully described in Chapter V. Thus, in 
English, history, mathematics, and physics, both grades in 
course and objective type achievement test scores were available 
for the same 224 students, all of whom pursued а uniform cur- 
riculum. Intercorrelations among grades ranged from .40 to 
.57, the median coefficient being .44. Among achievement test 
Scores, corresponding coefficients ranged from 86 to .61, the 
median being .43. Intercorrelations among instructors! grades 
were strikingly similar to those reported for civilian students а 
decade earlier (Table 6 over). One question which might well 
be raised is why the intercorrelations (.60 to .84) among exam- 
ination scores at Chicago (Table 5B) were so much higher 
than in the Navy situation at Yale, also employing objective 
tests. The data pose the question but do not supply the answer. 


TS ON THE ACPE 


Buros? Yearbook contains two reviews of the American Coun- 
cil Psychological Examination for College Freshmen. Both 
reviewers (Jack W. Dunlap of the University of Rochester and 
Robert L. Thorndike of Columbia University) remark on the 
allowance of 19 minutes for practice exercises and only 33 for 
actual testing time. They further comment in part as follows: 

Dunlap: *Some data are presented as to the validity of the 
1988 forms, where it is shown that the examination correlates 
approximately .50 with the results of each of four six-hour ex- 
aminations in introductory courses in biology, the humanities, 
physical sciences and the social sciences." (Buros, 1941, p. 


200.) 
Thorndike: “These tests represent, then, the continuation of 


а well-planned hour test for college students with certain new 
variations, the value of which remains to be determined by fur- 


ther research.” (Idem, 1941, p. 201.) 


examinations, but gave up the practice recently. Our reliability as computed 
by split-half formula and by the Kuder-Richardson formula where appropri- 
ate, ran in the neighborhood of .95. Many were as high as .98 and .99." 


OTHER COMMEN 
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All the foregoing data refer to earlier forms of the ACPE 
than the 1940 edition (subsequently reviewed in Part II) and 
therefore to total rather than disparate “Q” and “L” ratings. 
Since the major purpose of our study as a whole is to evaluate 
methods for differential prognosis, further discussions of the 
American Council Psychological Examination will consider its 
separate parts in chapters respectively dealing with verbal and 
mathematical aptitudes. 

Total score on this test is still regarded by its authors, and 
by many persons using it with high school or college freshman 
students, as а good all-round measure of academic promise. 
Utility of the American Council Examination in that sense 1s 
not challenged, despite the criticisms directed against certain 
technical aspects of its standardization—with special emphasis 
upon national norms. It is doubtless one of the best modern, 
general intelligence tests; like any other instrument, however, 
it should be properly calibrated to meet the demands of a given 
situation or task. Our criticisms in fact pertain less to com- 
position of the test per se than to inadequacy, thus far, of its 
evaluation in meaningful terms. мені 

Previous remarks as to the questionable significance, for 
many institutions, of ACPE percentiles as published do not at 
all deny that the examination may be used to their advantage. 
First, however, norms pertinent to the respective educational 
standards of each college should be established de novo and 
critical points (e.g. for admission, scholarship awards or ad- 
vanced placement) also determined locally. The same observa- 
tions apply to other tests of like general nature and probably of 
about the same difficulty, such as those developed in statewide 
measurement and guidance programs. These are, presumably, 
appropriate to the objectives and abilities of students exam- 
ined; they may or may not be so for other groups 1n different 
localities, nor are they expected to be thus universally ap- 
plicable without restandardization. The very fact that numer- 
ous state universities have constructed their own general intel- 
ligence or scholastic aptitude tests indicates a recognized need 
for special, localized evaluations. 


NEED FOR EQUIVALENT-SCORE DATA 
Interpretation of individual results on any such program 
outside the locale wherein it has been specifically developed is 
surrounded with uncertainties. The Director of Admissions at 
Western Reserve University or at Antioch College, for example, 
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has doubtless stored up considerable experience with the Ohio 
State Psychological Examination and knows rather well what 
а given score thereon means as to probabilities of academic suc- 
cess at his institution. But can either of them assume that vari- 
Qus scores, or even percentile ranks, reported for analogous tests 
held in other states, or even for the АСРЕ (unless these meas- 
ures have been adequately studied throughout several preced- 
ing classes on his own campus), have equivalent significance? 
Conversely, how shall the University of Wisconsin or of Michi- 
gan interpret results from the Ohio, Illinois or Iowa testing 
programs? In some instances, percentiles or standard scores 
may be virtually interchangeable ; in others they are not. More- 
over, an institution which draws students from every section of 
the country may receive a wide range of test reports expressed 
in different ways: e.g., as raw scores on varying scales, as intel- 
ligence quotients of one sort or another or as percentiles. A re- 
cent instance illustrative of this situation is provided in an an- 
nouncement by the University of Connecticut. The following is 
quoted directly from a bulletin announcing the Connecticut Co- 
operative Testing Program: 

“Scores of two types will be used in reporting results: (1) 
Raw scores—scores obtained directly from the scoring of the 
test or instrument; and (2) Derived scores— scores obtained by 
transmutation of the raw scores to scores of some other type, 
the particular type of derived scores varying from test to test 
and being dependent upon the type of norms used by each test 
publisher. Scores reported to the participating schools will fol- 
low the publisher’s policy for each test or instrument, i.e., raw 
scores when the publisher presents his norms subject to direct 
interpretation from raw scores, and the appropriate type of 
derived scores when the publisher presenis his norms subject to 
direct interpretation only from derived scores.” (University of 
Connecticut, 1945, p. 27.) 

Unless a unified program is developed, the institution dis- 
tributing tests can do little else than to follow the pattern re- 
ported in the University of Connecticut announcement. Thus 
the confusion already inherent in evaluating secondary-school 
records often seems but worse confounded by multiplicity of 
the very measures—objective achievement or scholastic intelli- 
gence tests—which were originally intended to resolve that 
confusion. A common denominator among such instruments is 
badly needed. The writer some years ago suggested to the Amer- 
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ісап Council on Education that it sponsor а broad sampling 
experiment, from which a table might be prepared indicating 
approximate equivalents on a number of the scales now vari- 
ously employed in general academic prognosis. One could then 
—roughly to be sure, but more readily than is now feasible— 
transmute individual scores, as severally reported by different 
testing agencies, to reasonably meaningful basic terms. 


THE COLLEGE BOARD SCHOLASTIC APTITUDE TEST 


Several references will subsequently be made to the Scholastic 
Aptitude Test developed by the late Professor Carl C. Brigham 
(1932) of Princeton University and his associates (1933 ff., 
Annual Reports of the Secretary, College Entrance Examina- 
tion Board) in regard to both verbal and mathematical indices. 
A. special commission, with Brigham as Chairman, was ap- 
pointed by the College Entrance Examination Board in 1925 


*to direct the preparation and scoring of the psychological 


examinations to be held in June and September.” ? The commis- 


sion’s first report appeared the next year and contains several 
interesting comments on the subject of general intelligence and 
scholastic aptitude measures. Although not now, or for some 
years past, a general intelligence measure, it is mentioned at 
this point for the purpose of tracing its evolution and also for 
brief comparison with the American Council Psychological Ex- 
amination. Both instruments started their testing careers, so 
to speak, as nondifferential indices of academic promise. Both 
have, at different times, been revised to yield separate verbal 
or linguistic and mathematical or quantitative scores. That is, 
the “V and “M” notations employed by the College Board to 
identify respective parts of its Scholastic Aptitude Test cor- 
respond roughly to Thurstone's “L” and Q designations 
earlier noted in connection with the American Council Psycho- 
logical Examination. 4 

'These two broad areas were first segregated in modern at- 
tempts to replace omnibus tests with others of greater factorial 
purity and enhanced directional import. Overlapping exists 
between them, for the very good reason that each is basically 
important in any well-rounded education. More specialized 
aptitude measures, such as will be considered in chapters to fol- 
low, represent an even later development of individual-guidance 


measures for particular needs. 


9. Cf. the discussion of pretesting methods on pp. 217 ff., Chapter VII. 
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A résumé of progressive stages through which the College 
Entrance Examination Board’s Scholastic Aptitude Test has 
passed may serve to illustrate the trend in recent years from gen- 
eral to specific aspects of testing. Originally, the SAT was com- 
posed of nine subtests (plus an experimental section employed 
for the trial of new items or ideas) as follows (CEEB Annual 
Report, 1926) : 

1. Definitions 
- Arithmetical Problems 
- Classification 
. Artificial Language 
- Antonyms 
Number Series Completion 
Analogies 
- Logical Inference 
. Paragraph Reading 

10. Experimental Materials 

А profile based on the first nine SAT subtests (if arranged 
in а manner analogous to the profile originally provided by 
Thurstone for the ACPE) would yield corresponding “L” and 
“Q” indices. Evolution of this instrument over a considerable 
period has accompanied notable developments by the College 
Entrance Examination Board in measurement techniques, as 
discussed in Chapter VII. These are partially described, with 
respect to item analysis, the scaling of questions in order of diffi- 
culty, discriminative power, etc., in Brigham’s (1932) progress 
report somewhat quizzically entitled 4 Study of Error. M 

It will be noted that the initial SAT, containing as it did 
several diverse parts, resembled the then usual omnibus type 
of measure, with three important exceptions. First, each sub- 
test was separately scored, even though reports were usually 
made on a total-score basis. Second, within the extensive College 
Board schedule, it could be allotted a time limit of two to three 
hours which permitted the use of subtests long enough to be 
individually quite reliable. Third, it was decided from the start 
to convert original or raw into sigma scores on a scale whose 
mean was then set at 500 with a standard deviation of 100. Ad- 
vantages of this system, which has since become widely used, are 
summarized in Chapter II. Mathematical derivation of the total 
score thus obtained from the various subtest scores is discussed 
in the College Entrance Examination Board’s (1926) Report 
just cited. Progressive development of the Scholastic Aptitude 
Test is outlined roughly as follows: 
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Years Composition 
1926-27 9 subtests of mixed nature 
as listed above 

1928-29 7 verbal subtests 

1930-35 8 verbal subtests 
] mathematical section 
(100 items in composite 
series of ascending diff- 
culty) 

1936 5 verbal subtests 

1937-41 April Form: З verbal sub- 
tests, 1 mathematical sec- 
tion (100 composite items) 
June Form: analogous to 
that for 1936 

1942-46 April, June, September 


and December or January 
forms, analogous to pre- 
ceding April tests 


The developments noted above ha 
beginning with 1928 a pure verb: 
brought to a high state of tec 
years, as currently, & mathema 


than the foregoing опе. 
From 1937 to 1941 there 
year—a spring version со 
cal elements (segregate 
immediate predecessors» was е 
ing widespread adoption of war 
in 1942, the traditional (long- 
aminations were suspen 
tude Test is essentially that first de 
of financial aid recipients (1.6 the 
verbal and mathematical, repor 
instatement of these separately; 


ntaining bot 
d) and а June edition which, like its 
ntirely verbal in content. Follow- 
-accelerated academic programs 
form) June College Board Ex- 
ded. Hence the present Scholastic Apti- 
signed to aid in the selection 
April form, with two scores, 
ted). The writer feels that re- 
as required prognostic tests of 
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Measurement Objectives 


General scholastic promise 


Verbal background and 
comprehension 


Separate indices of: 
A. Verbal background 
and comprehension 
B. Mathematical back- 
ground and ingenuity 


Purely verbal (mathemati- 
cal section supplanted by a 
new and separate Mathe- 
matical Attainment Test) 


Approximately the same as 
for 1930-35 period 


Same as for 1936 


Same as for 1997-41 April 
program, yielding separate 
verbal and mathematical 
“aptitude” indices. 


уе produced (а) in every year 
al factor test, which has been 
hnical perfection, and (b) in some 
tical aptitude measure also of 
considerable effectiveness, though probably less pure and stable 


were two SAT’s even within each 


h verbal and mathemati- 
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all candidates for admission to many colleges, is fortunate from 
a student-guidance standpoint. Experience with the earlier and 
present measures alike has demonstrated conclusively their 
value in differential counseling. These details may seem unduly 
elaborated, but they are nevertheless essential to full realization 
of the successive metamorphoses through which this Scholastic 
Aptitude 'T'est has passed. 


SUBSEQUENT CONSIDERATION OF DIFFERENTIAL 
MEASURES 


Achievement, aptitude and general intelligence tests have 
thus far been considered but sketchily, in terms of basic prin- 
ciples and objectives. Chapter I was intended to indicate how 
these various types of mental measurement vary in some re- 
spects and overlap in others. Following a parenthetical outline 
of statistical terms, methods and problems, attention has been 
directed in this chapter to so-called IQ or general intelligence 
testing procedures. 

These introductory remarks, while in one sense not particu- 
larly germane to our main topic of differential aptitude testing 
in various educational fields, nevertheless serve as background 
media, providing a means of orientation for the more specialized 
materials to follow. "Гһе following chapters, though continuing 
this introductory stage, will deal more specifically with differ- 
ential measures. As already explained, the consideration of 
particular aptitudes for various curricular fields of concentra- 
tion at college levels will in turn constitute Part II. 

Most plans of arrangement have the fault of backtracking at 
times, which seems in this case unavoidable. The measurement 
techniques in question, like the mental abilities they seek to 
evaluate, present a tangled skein which not even the most eso- 
teric statistical procedures can fully unravel. Topics such as 
evolution of the ACPE or SAT from omnibus to differential 
tests, for example, introduced at one point may initially em- 
phasize forward reference; later, іп relationship to some earlier 
setting or objective, a retroactive outlook develops which cannot 
well be ignored. The progressive emergence of various ideas or 
methods most pertinent for discussion in certain fields of study 
has never followed any logical sequence. Many “hunches” have 
affected the evolution of mental tests, and no timetable thereof 
can well be established in the dim light of present knowledge. 


CHAPTER IV 


ACHIEVEMENT TESTING 


UMEROUS crities have deplored the wide extension 
of so-called “merely factual” examination technique, 
asserting that it tends to freeze educational curricula 

and pedagogical methods alike into undesirable patterns and 
to place undue emphasis upon rote memory. Analogous crit- 
icisms have long been made, though for different reasons, 
against other methods, suspect as machinery to standardize 
Scholastic appraisal. This sort of criticism of College Board, 
Regents or certain professional and Civil Service examinations 
is primarily directed toward prescription of the subject matter 
per se. Quite apart from that special problem, the one men- 


tioned above deserves consideration. 
Some critics of objective tests maintain that no such devices 


can adequately measure ability to think, to organize ideas or to 
that such instruments demand 


express oneself clearly—in short, 

little or nothing beyond mere recall of isolated and perhaps un- 

important facts. At one time, comments of this nature were 
o the earlier short-answer 


undoubtedly pertinent with regard t 
test forms, particularly where reliance Was placed upon mere 
hese are inherently weak because it is 


true-false questions. T | 
difficult in most educational areas to concoct items of a search- 
mg (i.e., more than superficial) nature which can assuredly be 

» Moreover, shrewd guessing 


answered as either “yes” or “по. | L 
can sometimes outwit statistical formulas designed to circum- 


vent it. Careful test constructors have since fully realized the 
shortcomings of earlier procedure. Multiple-choice, matching, 
logical inference and other ingenious short-answer devices 
evolved in recent years show 8 definite trend toward the meas- 
urement of higher and more complex thought processes than 
could well be appraised by simply recalling isolated bits of in- 


formation. 

1. The hurly-burly in educational journals and the public press as to the 
validity and fairness of the New York Times History Survey well exemplifies 
this never-ending type of controversy. Since publication of the initial results 
and description of the test employed (Fine, 1943), many articles and com- 

th the survey itself and the test 


munications have appeared pro and con bo 
constructed to measure students’ knowledge of American history and geog- 


raphy. These are too numerous for individual citation. 
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It is regrettable that some petty questions do still appear 
on objective tests. Those assembled by a single department or 
individual professor are often the worst offenders in this respect. 
Limited and local construction seldom has the benefit of pre- 
testing ;* also personal idiosyncrasies of the examiner may in- 
troduce freak items. However, that characteristic is by no means 
limited to “new-style” (1.е., multiple-choice or other restricted- 
answer) examinations. Specific instances could be cited, were 
that necessary, of term papers and finals which likewise test the 
student's recall of petty details as a check on what he has read. 
One is at a loss to understand how minute lore of this sort (in 
whatever form) is related to appreciation of a great literary 
work or of broad historical and socio-economic concepts. Like 
the glossary to Shakespeare, such information provides an easy 
medium for the instructor's use in constructing an examination 
with limited objectives. If corresponding items appeared in 
multiple-choice or matching form on any objective-type test, 
they would probably be denounced as trivial, unfair and ex- 
cessively factual by the very man who gloats over the catchy 
questions devised for his own use. 


Completion versus *Place-Your-Bet" Answers 


Considerable difference nevertheless exists between the nature 
of response elicited by even those molecular completion ques- 
tions discussed above and that requiring merely a choice among 
several options. "Гһе former demands unassisted recall by the 
student of definite previous assignments, and perhaps of col- 
lateral reading as well, plus some original exposition (however 
brief) on his part. The latter (multiple-choice form) presents 
an easier process of picking one out of four or five possible win- 
ing answers. Messieurs, faites vos jeux! 

“Multiple-choice,” so often attributed to the currently most 
popular form of objective-type items, is a misnomer. “Re- 
stricted answer” gives a better and more realistic description, 
since the choice afforded is multiple only to a degree restricted 
by the few offerings provided. These, moreover, may be preg- 
nant with either negative or positive clues. Hence the right 
answer can sometimes be reached by the elimination of two ob- 
vious impossibilities and a shrewd guess as to which of the others 
is most likely. Theoretical corrections have been developed to 
penalize such guessing by deduction for wrong answers (e.g 


2. Cf. Chapter VII (pp. 237-238). 
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total score computed by *Rights minus 14 Wrongs" or some 
analogous formula, the negative fraction depending upon num- 
ber of options per question). These are, however, dubious in 
practical value because negative clues and a little knowledge of 
the subject may enable a student to place his bets successfully 
on many questions which he could not answer so well in his own 
words. 

For example, except in tests specifically designed to measure 
such elements as spelling, English usage or familiarity with 
foreign languages, all items should be well expressed without 
ambiguity ; and of course with proper names, geographical lo- 
calities, etc., correctly spelled. That characteristic alone in re- 
stricted-answer questions is advantageous to the respondent. 
His own grammar, clarity of expression and even spelling are 
not tested, as they would be in writing even a brief factual 
answer not elaborated as a theme or essay on the same topic. 
Pullias illustrates the point at issue as follows: 

“In the main, the verbal responses required by new type tests 
are of two kinds, namely, the recall type and the recognition 
type. In the first the learner must recall information which the 
tester believes to be indicative of learning. For example: Amer- 
ica was discovered by . The pupil must be able to supply 
the correct response. In the case of the second type the learner 
must recognize the response that indicates learning. As an illus- 
tration: America was discovered by (а) Julius Caesar; (b) Sir 
Walter Raleigh; (c) Columbus; (d) Amerigo Vespucci." 
(Pullias, 1938, p. 75.) А У 

Pullias was concerned with the proportion of answers which, 
though partially incorrect (e.g., misspelled) » were nevertheless 
accepted in varying degrees by teachers as indicating knowl- 
edge of the question. Considerable “variation in evaluation of 
responses” was found; certain errors were evidently regarded 
as much less serious than others. No doubt they are, but in sub- 
jective grading it would hardly seem fair to fail a well-written 
essay just because one or two words are incorrectly spelled. 
However, the main issue here is a strong probability that most, 
if not all among these variously “more or less wrong answers 
would have been strictly correct, had the pupils been able merely 
te choose tione well-worded, properly spelled alternatives 
rather than to complete the question for themselves. 
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RESTRICTED-ANSWER EXAMINATIONS 


In the earlier discussion of sampling methods,’ brief mention 
was made of restricted-answer versus long-form (essay) exam- 
inations. It was suggested that one operated through an exten- 
sive but perhaps superficial sampling of familiarity with a great 
number of items; the other, through probing more deeply, 
though on necessarily fewer topics, into the students knowledge 
of a comprehensive nature. Thus each type of examination has 
its merits and limitations. Before proceeding to discuss specific 
achievement testing batteries, further consideration of this in- 
triguing question may be in order. 

Such aspersions upon restricted-answer tests as were men- 
tioned above perhaps give a certain retributive satisfaction to 
supporters of the “old-school” essay examinations, long under 
fire (as “uncertain, noncomparable, highly subjective” instru- 
ments) from persons whose own new-type batteries are now 
being shelled by a counterbarrage. It seems to have been curi- 
ously overlooked that veritable prototypes of the modern ob- 
jective, wide-sampling test methods are exemplified by what, 
long ago, used to be called spot questions or passages. In the 
writer’s undergraduate years (which considerably antedate the 
modern testing movement), instructors who later regarded this 
with scorn had themselves long employed spot questions as & 
sort of objective-testing technique. No doubt that formidable 
phraseology, if applied to their own efforts, would have given 
them academic shudders, not without justification. They were 
nevertheless striving as amateurs for certain aims which were 
soon to be pre-empted by measurement professionals, with con- 
sequent recrimination. 

Perhaps the chief difference between earlier, personal at- 
tempts to introduce objectivity in sampling and grading 
through spot passages etc., and later development of more 
scientifically controlled testing procedures, amounts to no more 
than a guerrilla conflict between subject-matter experts who 
have a limited grasp of testing methods and testing experts who 
have a limited grasp of the subject matter. Obviously they 
should cease sniping at each other and collaborate instead. The 
Cooperative Test Service, the College Entrance Examination 
Board, the Carnegie Foundation’s Graduate Record Examina- 
tion and many local undertakings at various institutions have 
achieved notable cooperation along these lines. 


3. Cf. Chapter II, “Statistical Principles” (pp. 26-28). 
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It is interesting to note that certain items of the sort which 
might well be decried in a short-answer test as unduly narrow 
or factual nevertheless appear in blessed sanctity as direct de- 
scendants of the old spot passage, even on current long-form ex- 
aminations. While these are ostensibly of the essay type, some 
questions which seem to demand mere factual knowledge or 
memory for details may be introduced for sampling purposes. 


Standardization versus Comparability 


The problem of fairly and comprehensively measuring educa- 
tional achievement by dependable means, and in respect to other 


than unduly limited (i.e., local and therefore perhaps nationally 


inadequate) criteria, is complicated and presents an obvious 
dilemma. If the method adopted is individual and specialized, 


then it may also be insular and esoteric; if broader in scope, it 


may be in turn denounced as encouraging standardization. The 


latter indictment, wrongly or not, has often been brought 
against objective-type measures and by some critics particu- 
larly against the Cooperative Tests shortly to be discussed. 


Thus far, at least, they seem to have endured the buffeting 
hich they are subjected lies 


rather well. A greater danger to w. ! 
not in weakness of the technique they represent, but in exagger- 
ated, naive reliance by some teachers or counselors upon their 
results as the *be all and end all" of individual measurement. 
That is certainly not the view of their leading proponents. 
Wood,! for example, has written: “Тһе objective examination, 
as already indicated, supplements the essay examination quite 
effectively, supplying its worse deficiencies most happily. That 
role of serving as 4 complement to, rather than a substitute for, 
other means of educational measurement and guidance is 
stressed throughout the papers cited below and many others. 
The foregoing comments may appear digressive, but they are 
of some interest because numerous criticisms and subsequent 
rebuttals are to be found in educational literature concerning 
4. Wi „ "The Structure and Content of the Comprehensive Exami- 
nation a ed Sophomores,” Recent DEUM e rame E Дар 


ter XX, p. 198 (Proceedings 0 шше А 4 
Нено Vol. III, 1981, University of Chicago Press, Chicago, 


Illinois). У A 5 r 2425 
D., “Basic Considerations in Educational Testing; 

E а E et A on Educational Testing, Educational Records 
Bureau, New York, May, 1983; and *The Major Strategy of Guidance," Edu- 
d У М 1984, рр. 419-444. McConn, Мах, 


i d, Vol. XV, No. 4, October, 1 І 
а ER New: Their Uses and Abuses,” Educational Record, 


Vol. XVI, No. 4, October, 1935, рр. 875-411. 
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old-form or essay-type examinations, and more recently of ob- 
jective tests. In the writer's opinion, each can do some things 
better than the other and some things not so well. Effective 
measurement of a student's educational achievement (whether 
in school, college or subsequently) calls for both objective and 
subjective methods of appraisal, in due combination. 

In this respect it is worthy of note that the New York State 
Board of Law Examiners has for some years utilized, with 
marked success, both long-form (essay) and short-form (true- 
false) questions in examining candidates for admission to the 
Bar. As will shortly be indicated, academic records largely rep- 
resenting subjective appraisals of achievement, when combined 
with objective test scores on the Carnegie Foundation’s Grad- 
uate Record Examination, yield a better index of promise for 
advanced (especially Doctoral) studies than does either of these 
grading methods alone. Hence it is well to remember that ob- 
jective tests of the type discussed in this chapter are by no 
means the only dependable measures of academic achievement. 
Yet they offer great advantages in relation to group or individ- 
ual comparisons, where merely local marks may have little or no 
interchangeability from one institution to another. It was 1n 
fact a definite need and determined search for real comparabil- 
ity which raised the eventual bugaboo of standardization. 


Advantages and Disadvantages of the 
Restricted-Answer Form 


The writer believes that neither sort of measurement has all 
‘the shortcomings variously attributed to it; also that neither 
can fulfill completely the expectation of its respective sponsors. 
To repeat a conviction earlier expressed, the validity of scholas- 
tic evaluation under most circumstances reaches its highest 
practical effectiveness when both types of appraisal are em- 
ployed in due combination. The two methods, teamed in joint 
effort, will certainly accomplish more for the cause of education 
than either can alone. Unity of command is essential to success- 
The advantage of ob jectivity in scoring what are still called 
new-type tests (although they have-in fact enjoyed wide circu- 
lation for many years) is of course obtained at the expense of 
restriction in answers (e.g., true-false, multiple-choice, match- 
ing, etc.). The strictures directed against them imply that an 
equal degree of restriction is placed upon the mental processes 
by which one’s answer is selected; i.e., that mere “press-the- 
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button, you know it or you don’t” response necessarily is all that 
items of this type can evoke. 

Today, such a contention probably reflects lack of informa- 
tion. To be sure, silly, undiscriminating or otherwise poorly 
constructed restricted-answer items can be found, even in tests 
on which considerable care has been exercised. 'Their propor- 
tionate incidence among demonstrably valid items in a well- 
organized battery is, however (we are willing to bet), less than 
that of equally futile questions asked on many essay examina- 
tions and ten-minute papers. One basis for this contention is 
that short-answer questions are readily pretested—i.e., their 
functional efficiency determined through item analyses—so that 
measures can be devised with known characteristics, within 
reasonable limits of tolerance. It is only in recent years that like 
pretesting of essay questions has been undertaken with marked 
success, notably by the College Entrance Examination Board. 

Anyone who still regards obj ective tests as necessarily calling 
upon factual memory alone, and not employing higher intellec- 
tual processes than mere recall of isolated scraps of knowledge, 
should personally attempt to cope with a series of new Coopera- 
tive measures, the College Entrance Examination Board’s cur- 
rent achievement battery or that recently developed by the 
Carnegie Foundation as а Graduate (or General) Record Exam- 
ination. These are guaranteed to deflate his ego and establish 
a wholesome respect for their searching qualities. The writer 
at least has found plenty of room for thought in struggling with 
such tests and prides himself (perhaps unduly) upon reaching 
some correct answers through a process of ratiocination, after 
obsolescent recall had utterly failed. Some space will now be de- 
voted to certain of these better-known achievement testing bat- 


teries in more detail. 


THE COOPERATIVE TEST SERVICE 

ve Test Service, as its name implies, represents 
a project enlisting the cooperation of many outstanding edu- 
cators—whether in psychology; personnel methods, mental 
measurement, or their respective subject-matter fields. "This 
program as a whole was initiated by the American Council's 
Committee on Measurement and Guidance, which has supervised 
its development in general. Professor Ben D. Wood of Columbia 
University (who had long striven for the recognition and tech- 
nical improvement of reasonably comparable, objectively 


The Cooperati 
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scored, achievement measures ranging over the major educa- 
tional fields) directed. this important undertaking until 1945, 
when he resigned to be succeeded by Dr. Kenneth W. Vaughn. 

Establishment of the Cooperative Test Service in 1930 
(Hawkes, 1931, p. 35) was specifically made possible by a 
large grant, distributed over a ten-year period, by the Rocke- 
feller Foundation. If one may properly judge succeeding re- 
sults by the degree to which Cooperative Tests have since been 
employed and by the extent of their distribution,® it seems clear 
that the considerable investment for this purpose has yielded 
handsome educational returns. 

It is hardly feasible to set forth here any full description of 
the Cooperative Test Service offerings or results obtained there- 
from, because their very magnitude and importance preclude 
summation in specific terms. The Test Service itself has not 
thus far issued a definitive survey of its many accomplishments, 
although a number of separate bulletins and articles thereon 
have been published. Extensive reports are to be found in edu- 
cational literature, reflecting the wide use of Cooperative Tests 
as criteria of scholastic performance (Hawkes, 1931; McConn, 
1933; Wood, 1940). The latest and most authoritative state- 
ment about the use of Cooperative Tests in a guidance program 
is contained in Chapter V, “Evaluation of Achievement іп а 
Guidance Program,” in Traxler’s recent and important book, 
Techniques of Guidance (1945, pp. 68-97). 


Fields for Which Cooperative Tests Are Available 


The range of Cooperative Test Service materials currently 
available is quite wide. It extends over such areas as English 
(mechanics and effectiveness of expression, reading and literary 
comprehension, vocabulary, literary acquaintance) ; general 
achievement (respective proficiency in social studies, science and 
mathematics) ; contemporary affairs ; foreign languages; and 
many subdivisions of the broader fields into subject-matter 
units designed for successive educational levels.* 


5. In October, 1942, Mr. David G. Ryans, Executive Secretary of the Co- 
operative Test Service, reported that approximately 30,000 college students in 
170 institutions were tested in that year's College Sophomore Testing Program. 
Even more students (in about the same number of colleges) participated in the 
Freshman Placement Testing Program. (Letter of October 8, 1942, to the 
writer.) 4 E у 

6. For example, in Mathematics alone, the following ате listed: Cooperative 
Algebra Test, Elementary Algebra Through Quadratics, 40 minutes; Co- 
operative Algebra Test, Elementary Algebra Through Quadratics, 90 minutes; 
Cooperative Intermediate Algebra Test, Quadratics and Beyond, 40 minutes; 
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These several, extensive materials are described in bulletins 
issued from time to time and also in annual catalogues, price 
lists, etc., obtainable directly or from the American Council on 
Education. The latter body has also organized for some years а 
College Sophomore Testing Program (Crissy and Ryans, 
1942) and since 1937 a National Freshman Placement Pro- 
gram (Cooperative Test Service, 1943), both annually recom- 
mended to educational institutions throughout the country and 
largely utilizing Cooperative Tests. In addition, there is'a 
College Chemical Testing Program sponsored by the Committee 
on Tests of the American Chemical Society and the College 
Physics Testing Program sponsored by the Committee on Tests 
of the American Association of Physics Teachers. The Coopera- 


tive General Achievement Tests of (I, General Proficiency in 


the Field of Social Studies; IL, General Proficiency in the Field 
the Field of 


of Natural Science; III, General Proficiency in 


Mathematics) suggest an increased interest in the student's 


proficiency in dealing with typical materials in a wide field, 
rather than knowledge of the particular body of facts included 


in a certain textbook or course. 
"This project as а whole has in one way or another been re- 


sponsible alike for the development of improved objective- 
testing facilities and for a great expansion of their use. More- 
over, Flanagan's development of Scaled Scores’ establishes 
notable progress not only in educational tactics but also in the 
major strategy of measurement. The strategy represents an at- 


tempt to provide a system (e.g. with “fundamental, intrinsic 
meaning,” as set forth in the explanatory bulletin cited below) 
of uniform and stable scores for achievement tests at the sec- 
ondary school and college levels. Tactics in turn have been 


facilitated by development of more meaningful norms, based 


upon specifically defined educational groups, than have previ- 
ously been constructed. 

Cooperativ diate Algebra Test, Quadratics and Beyond, 90 minutes; 
[Ni pU EU MS Test, 40 minutes; Cooperative Plane Geometry 
Test, 90 minutes; Cooperative Solid Geometry Test, 40 minutes ; Cooperative 
Solid Geometry Test, 90 minutes ; Cooperative Trigonometry Test, 40 minutes ; 
Cooperative Trigonometry Test, 90 minutes; Cooperative Mathematics Тен БЕ 
Grades 7, 8 and 9,80 minutes; Cooperative General Mathematics Test for ee 
School Classes, 40 minutes; Cooperative General Mathematics Test for ИЕ 
School Classes, 90 minutes; Cooperative Mathematics Pre-Test for College 
Students, 45 minutes; Cooperative College Mathematics Test (for first-year 


courses), 40 minutes Cooperative Test Service; 1945). к | 
7. ME John te Scaled Scores, Bulletin of the Cooperative Test Service, 
New York, December, 1939. (А Sample Individual Profile of Achievement Test 


Scores is there reproduced on p. 87.) 
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Flanagan's procedure analyzes relative basic ability (going 
back to earlier performance as measured before curricular dif- 
ferentiation occurs) as reflected by students subsequently elect- 
ing various subjects. He has done much to make the appraisal 
of their respective promise comparable. Cogent remarks on 
validity, reliability indices, norms and other interpretive data 
will also be found in this bulletin (Flanagan, 1939). 


THE PENNSYLVANIA STUDY 


We shall now briefly review one of the most important ob- 
jective-testing programs thus far undertaken anywhere on а 
large scale and over a period of years—the Carnegie Founda- 
tion's historic Pennsylvania Study (Learned and Wood, 1988). 
Although this cannot be adequately discussed here, considera- 
tion of the voluminous bulletin cited will prove stimulating and 
informative on the topic just mentioned and many others per- 
tinent to educational guidance. 

After a necessary period of development and organization, 
the Pennsylvania investigation was formally launched in 1928. 
Both high school and college seniors were given a comprehen- 
sive series of objective achievement tests, twelve hours being re- 
quired for experimental trial (essentially pretesting of ma- 
terials) with the latter group. A reduced program was adminis- 
tered to college sophomores in 1930, and again in 1932 to them 
as seniors. Meanwhile a follow-up study of *progress groups 4 
was also conducted; e.g., among high school seniors in 1928; 
their representatives in college as sophomores two years later, 
and eventual senior-class survivors. The composition of this 
battery is indicated in the following summary :8 


Common Subjects s; 
Intelligence (Otis) ? 80 minutes 
English Usage (including Spelling, 

Grammar, Punctuation, Vocabulary 

and Literature) 120 minutes 
Mathematics 140 minutes 
General Culture (including Genera] 

Science, Foreign Literature, Fine 

Arts, History and Social Studies) 240 minutes 


8. For a more complete description of these tests see Learned and. Wood 
(1988, p. 380). А 5 

9. Form А was administered to high school seniors in 1928 (op. cit., p. 185). 
Form C was administered to college students in 1930 and 1932 (op. cit., p. 219). 
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Optional Subjects 


Languages 90 minutes 
Social Studies 90 minutes 
Sciences 120 minutes 
Education 180 minutes 


One of the most striking general results from this extensive 
study (for which unusually generous time allowances were 
available) was the clear demonstration of widely varying test 
performance in relation to grades-in-course at various colleges. 
This cast grave doubt upon the significance of relative academic 
marks. The findings were not unexpected—indeed, if dissatis- 
faction with grading methods had not already been widespread, 
the investigators would not have undertaken their difficult task. 
However, the results obtained set forth certain problems and 
vagaries of grading in irrefutable, quantitative terms. Accord- 
ing to objective standards of acquired knowledge, students of 
A rank at some institutions were inferior to those of C rank at 
others, Yet even the former were accredited colleges within the 
same state. These and associated data as to students’ revealed 
knowledge (or lack of it) in relation to formal course work are 
pertinent to the entire question of reliability in measuring scho- 
lastic achievement. On this, in turn, validation of aptitude tests 
too uncertainly depends. 

Space does not permit much further comment on the Penn- 
sylvania Study: its extensiveness precludes cursory digest. A 
few of the correlations obtained between performance on the 
battery of tests administered and standing in school or college 
are abstracted and summarized in Table 7 and subsequent dis- 
cussion. Ее 

It із clear that the data just cited represent sfability coeffi- 
cients; i.e., correspondence between successive administrations 
of analogous measures to the same individuals over the intervals 
stated. They are slightly higher than the correlations between 
grade averages for the first two versus the second two years of 
college (.71) as reported below.”° 

Relationship between different of t 
and achievement testing battery and the criterion afforded by 

tter: the evidence regard- 


college grades is quite а different ша 
ingit lene conflicting. At one point Learned and Wood (1938, 


p. 15) state: “Оп the whole, the tests yield favorable correla- 


10. For further details as to mean, sigma, etc. see The Student and His 
Knowledge (Learned and Wood, 1938, p- 886). 


sections of the intelligence 
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TABLE 7 
Certain Correlation Data from the Pennsylvania S шау" 
Number of Cases Variables Compared n 


1,187 High school senior total (sigma) test scores and 


the same persons’ college sophomore total scores, 
two years later 78 


1187 Mean high school senior (sigma) scores and the 


Same persons' college senior sigma scores, four 
years later 179 


1,154 Scores on Otis Self-Administering Test, Form А, 
for high school seniors, with corresponding scores 
of the same persons on Otis Test, Form C, in their 
college sophomore year 75 
1,154 College Sophomore scores on Otis Test, Form C, 


with senior year Scores on equivalent Otis Test, 
Form C, two years later 88 


8e grades—an average of .63 
among 16 colleges studied for conditions in 1928.” No further 
data on that situation are offered, but a detailed table, embrac- 
ing 2,800 cases, presents the relationships found between each 
division of the testing battery and grades for the first two and 
the second two years of college respectively. These coefficients 
lligence scores (Otis) 


gh school achievement 
battery. 


Since the Pennsylvania Study attempts to demonstrate the 
superiority of objective measures, and cast 
upon college grades or othe 
would doubtless attribute th. 
the criteria rather than to shortcomings of the test battery. Yet 


“(8) Test Scores and Average College Grades 

“For all but 30 of the 2,830 students in the sophomore-senior 
progress group college grades were available in forms that could 
be coordinated into one scale: A to D, with E fore оа 
and Е for failure. This was converted into a numerica] scale by 


11. Data for this table were obtained from Learned and Wood (1938, pp. 
217, 219, 887 and 888). 
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counting A as 9, A— as 8, B+ as 7, and so forth. E was allowed - 
a value of 1. 

“The grades of a student’s first two years and second two 
years were averaged separately. The first series gave a mean of 
5.07 with a standard deviation of 1.54; the second gave a mean 
of 5.55 with a standard deviation of 1.44. The correlation be- 
tween the college grades of the first two years and second two 
years when averaged in this manner was .71.” (Learned and 
Wood, 1938, p. 386.) 

Throughout this monumental research, one example after 
another is given to illustrate the wide variance of academic 
standards among the colleges studied. Thus to pool superficially 
similar grades as if they were equivalent, after much effort in 
demonstrating that they decidedly are not, is extraordinary, to 
say the least. The correlations referred to above might first have 
been calculated between parts of the test battery and classroom 
grades in each college separately, and then perhaps combined 
‘into one or more institutional groups by some weighted-average 
method. Had this precaution been taken, the coefficients ob- 
tained between test scores and college averages would almost 
certainly have been higher than those resulting from the method 
followed. 

While no direct evidence can be presented to support this 
claim, r values of only .29 and .26, between Otis test scores re- 
corded in the senior year of high school and grade averages for 
the first two and last two years of college respectively, are much 
lower than have repeatedly been found in like circumstances for 
this or other general intelligence measures. The correlations 
reported have probably been spuriously lowered by improper 
massing of non-equivalent college grades. Hence results of the 
Pennsylvania Study in respect to all subjective criteria (which 
despite their undeniable shortcomings at least deserve fair 
treatment) appear to have been vitiated by abnormal compila- 
tions. The achievement battery described provides a measure of 
each student's relative proficiency in various fields; such data 
cannot legitimately be compared with subsequent average 
marks, indiscriminately assembled from colleges known to have 
quite disparate standards. у Кие: 

Statistical reliabilities of the objective tests utilized in Penn- 
sylvania were all reasonably high—with but one exception, over 
.90 (Learned and Wood, 1938, р. 15). Yet it will be noted 
that retest coefficients, obtained at progressive stages of meas- 
urement (e.g., senior year in high school; sophomore and senior 
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years in college) did not much exceed -70, except for equivalent 
forms of the Otis Test of Mental Ability (i.e., scholastic intel- 
ligence) over а two-year interval? The Pennsylvania Study 
was conducted under decidedly favorable conditions—with able 
leadership, ample funds and a statistically ideal situation, viz: 
where objective test batteries of high reliability and largely 
similar content could be widely administered under analogous 
conditions at chosen intervals and thereafter comprehensively 
analyzed. Few research workers are in a position to enjoy like 
facilities and experimental advantages. 

The foregoing data indicate that throughout two or more 
years students themselves change sufficiently, in terms of meas- 
ured performance, to place definite limitations upon early ap- 
praisal of their mental capacities or output. We have elsewhere 
repeatedly deplored vagaries of the usual criteria (school or 


themselves do not agree more closely over а two-year period 
than is indicated by correlations ranging from .70 to .75, then 


THE GRADUATE RECORD EXAMINATION 


This elaborate instrument of appraisal (colloquially referred 
to as the “GRE”) is of special interest because it has elevated 
measurement of the student and his knowledge from high school 
and collegiate levels, as represented in the Pennsylvania Study 
and other state testing programs, into the comparatively rare- 
fied climate of graduate schools. (Given time and opportunity, 


12. Incidentally, the College Entrance Examination Board’s Scholastic Apti- 
ecialized, “verbal” 
measure, yields higher coefficients upon retesting over a like period (College 
Entrance Examination Board Annual Reports: 1929, p. 191; 1931, p. 193; 1939, 


accounted for by a much smaller range of SAT Scores among the girls repre- 
sented (standard deviations of only 51 to 64 for them, as compared with 83 
to 90 for boy “repeaters” and 100 for the entire Population tested each year)- 
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the Foundation may even reach that stratosphere where only 
Full Professors fly.) It has also been utilized to advantage by 
some institutions as a “General Record Examination” for the 
educational measurement and guidance of college sophomores 
with respect to their most appropriate field of subsequent con- 
centration. While wartime circumstances have naturally im- 
peded its present applicability, initial studies (as partially re- 
ported in special bulletins of the Carnegie Foundation) clearly 
demonstrate its potential value in this respect. 

The examination is in fact a battery comprised of the fol- 


lowing elements: 

(a) The General Examination of eight tests taken by all 
students, viz: mathematics, physics, chemistry, biological 
science, social studies (history, government and economics) , 
literature, fine arts and verbal factor. This latter is a test of 
vocabulary, reading and language comprehension (accuracy of 
discrimination in word meanings). 

(b) Advanced Subject Tests, of which each student takes 
one in the subject of his choice—usually his previous major. 
Advanced Subject Tests are offered in: fine arts, biology, chem- 
istry, economics, engineering, French, geology, German, gov- 
ernment, history, literature, mathematics, philosophy, physies, 
psychology and sociology. Still other Advanced Tests are in 
preparation. Extensive revision of the entire battery is under 
way, as reported in late 1945 by Dr. Kenneth W. Vaughn, Di- 
rector of this important project. д 

Whatever changes are made in the nature of items or even of 
fields represented in the general sections comprising this bat- 
tery, its essential nature will doubtless be preserved. This in- 
cludes a liberal sampling; through objective-type questions, of 
individual performances in the areas mentioned above and re- 
porting thereon by “profiles,” which indicate each participant’s 
relative standing on each section as compared with the total 
population tested. 

The references cited above and the sample GRE report form 
here reproduced illustrate this effective method (referred to 
. earlier and more fully demonstrated in the following chapter) 
of expressing variations of achievement or aptitude within the 
individual. GRE profiles are jntended to depict the student's 
general knowledge throughout those fields which ideally consti- 
tute the liberal arts and sciences. Unfortunately, as data from 
this examination and other sources demonstrate, that ideal has 
in recent years too often shattered on the rocks of specializa- 
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tion. As a more powerful measure of proficiency in that nar- 
rower sense, the GRE Advanced Sub jects Tests raise the stand- 
ard of measurement in their particular areas to distinctly 
higher levels. 

Several publications describing this project, or discussing 
the results thus far obtained, are available upon request to the 
Foundation (Graduate Record Examination, 1941a, 1941b, 
1944; Learned, 1942; Crawford, 1942; Vaughan, 1945. 

Major findings thus far clearly indicate: (a) high individual 
reliability of these numerous carefully developed objective 
tests, even at more advanced levels than have previously been 
attempted by these means; (5) validity equal if not superior to 
traditional academic records of scholastic performance based 
upon much longer observation (about seven hours testing 
time’? versus four academic years) ; (c) maximum forecasting 
efficiency of promise for graduate study, when this is based 
upon a combination of subjective (academic) and objective 
(GRE) records (Crawford, 1942). The latter conclusion sup- 
ports our earlier argument as to the complementary value of 
traditional and new-type educational measures, taken jointly. 

Without any question, development of this comprehensive 
and well-constructed battery represents one of the major ad- 


GRE in selecting students for advanced professional training. 
In this connection, the broadly representative sampling of 
“knowledge on tap” afforded by its profiles should prove espe- 


w Bulletin (Carnegie Foundation, 
Examination office, describing its 


concepts, generaliza- 


ideas, either recalled or 
supplied for the purpose, to demonstrate power of analysis and skill in ‘fol- 


"The modification above described will be found in аП the tests except the 
Verbal Factor. This test has been enlarged by the addition of a subtest in the 
Effectiveness of Expression which deals with the student's practical facility 
with words. The single vocabulary of the first Verbal Factor test lias Doe 
replaced by three vocabularies corresponding to the three major areas of study 
—scientific, social, and humanistic. 
"Instead of separate tests in physics and chemistry, all the physical sciences 
have been considered as a Single area, offset by a similar combination of bio- 
logical sciences. The four remaining fields—mathematics, social studies, liter- 
ature, and arts—appear under these names as before, but considerably reor- 
ganized for the purpose already explained.” 
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cially helpful in measuring promise for graduate study among 
men returning from the armed forces. 


PLACEMENT TESTING IN IOWA 


We shall now return to somewhat lower educational levels 
and consider achievement testing specifically, in terms of the 
latest Iowa state program. This stems from a project initiated 
more than thirty years ago by Professor Carl E. Seashore at 
the University of Iowa as College Qualifying Examinations. 
Incidentally, Seashore (1897, 1942) was one of the first ex- 
perimenters to stress the term “aptitude.” Even if the tests he 
then devised would hardly be regarded as of that nature today, 
а vast amount of subsequent trial-and-error research leading to 
new methods of appraisal must be recognized, and due credit 
given to his early appreciation of the aptitude concept. If Iowa 
has won fame in other fields as “out where the tall corn grows,” 
it has achieved no less recognition in educational circles for out- 
put of the University’s Psychological Laboratory and sturdy 
growth (no less tall) of the Iowa Placement Examinations. 

Other all-state testing procedures could well serve to illus- 
trate modern developments along these lines, which warrant at- 
tention both in their own right as diagnostic media and as a 
background for subsequent evaluation of differential prognos- 
tic tests. However, the present Iowa Battery has been selected 
for this purpose because it contains a series of differential 
measures yielding individual profiles of respective promise for 
(more or less) disparate college-preparatory or subsequent 
major undergraduate fields. The Iowa program combines both 
aptitude and achievement features; certainly it is forward- 
looking in respect to the educational guidance of youth, as well 
ав retrospective in terms of measuring past accomplishment. 
This it does, however, by comparatively new and original means, 
emphasizing what may be termed functions rather than facts. 


Towa Testing Service 

The Fall Testing Program for Iowa High Schools announced 
in detail by E. F. Lindquist (1942) envisages administration 
throughout four high school and the college freshman years of 
a uniform battery, appropriately stepped up throughout pro- 
gressive levels. It is intended not only to obtain differential 
measures of each pupil once, but also by annual retesting to 
provide indices of his intellectual growth in various directions. 
The underlying philosophy of this important project may best 


› 
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be exemplified by quoting directly from the bulletin just cited: 

“ТЕ is almost a truism that the value of an educational pro- 
gram depends on the nature and extent of the changes that it 
produces in the pupils, and accordingly that to evaluate the 
program one must measure, not merely the pupils’ educational 
status at any given instant, but rather their educational growth 

a development during a given time period. . . . 
` “Aptitude or capacity for growth is much better revealed by 
measures of past growth than by measures of present status 
alone... . f 

“There is, of course, only one way to measure growth during 
a given period, and that is to measure status in comparable 
terms both at the beginning and at the end of the period. Any 
testing program that is to provide for the measurement of edu- 
cational growth must provide for periodic measurement of edu- 
cational status with the same or with comparable tests.” 

This emphasis on the measurement of growth is analogous to 
that stressed, for example, by Flanagan and Wood with refer- 
ence to prognostic use of the Cooperative Test Service and by 
numerous other educational leaders, who strongly advocate 
cumulative individual records as a measure of such growth. In- 
cidentally, Sir Francis Galton (Walker, 1931, p. 45) was an 
early advocate of what would now be dubbed a *cumulative rec- 
ord card” for each student. 

Lindquist envisages testing on a large scale, as indicated by 
his expectation that from 50,000 to 70,000 Iowa high school 
pupils will participate annually in this program. Presumably 
through state subsidies, the net cost to schools is only twenty- 
five cents per pupil for a total of nine tests requiring at least 
seven hours of working time. He offers the following battery; 
grouped into four major categories 


А. General background tests 
1. Understanding of Basic Social Concepts 
2. Ability to Do Quantitative Thinking 
3. Ability to Write Correctly 
4. General Proficiency in the Natural Sciences 


B. Special reading tests 
5. Ability to Interpret Reading Materials in the Social 
Studies 
14, The literature on this topic is too extensive for specific citation; earlier 


in this chapter to various articles or bulletins by Flanagan, Hawkes, 
үлеп d, et al. repeatedly stress the basic principles involved. 
5 
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6. Ability to Interpret Reading Materials in the Natural 
Sciences 


7. Ability to Read Literary Materials 


C. A Special Test а 
8. Ability to Use Important Sources of Information 


D. A Special Test 
9. Ability to Recognize Important Word Meanings 


It is readily apparent that Lindquist has functionalized his 
approach in terms of certain £ools (chiefly verbal) such as read- 
ing, writing and understanding. These can be considered quite 
properly as broad aims, or with equal justification as restricted 
objectives, depending on how one looks at the matter. It is perti- 
nent to note, for example, that а particular student's growth 
in the special functions measured by Test 1 is to be ascertained 
by successive administrations of Test 1 rather than by his 
grades in a social science course at college. In other words, he 
might conceivably exhibit progressive gain in test scores? and 
yet make a mediocre college record within the same area. While 
that is not likely in fact, the contingency is mentioned here to 
emphasize purposive emancipation of this approach. from for- 
mal scholastic offerings and standards. 

Yet even one who favors and strives for curricular organiza- 
tion on a functional basis can hardly ignore the realistic, edu- 
cational status quo. A college student who hopes eventually to 
become an outstanding surgeon must first convince some good 
medical school that he is 8 suitable candidate for admission, 
and medical schools have а disconcerting way of scrutinizing 
undergraduate records. Moreover, that same student will earlier 


15. Unless due allowance is made for educational "growth in general" and 
jence to another quite similar in nature, 


for carry-over from one testing exper 
a certain amount of spurious gain may result as а product of repetition. That 
may occur with analogous material even though successive parallel forms are 


composed of different items. While this problem will be considered further 


in sub: hapters, it is appro riate to cite here a pertinent announcement 
Pa a mri БЕН didates’ scores of repeating the 


which reports the following effect on сап 

verbal E of the CHEB Scholastic Aptitude Test: “1) On the average, a 
candidate raises his score about 50 points when he repeats the test after ap- 
proximately one year. 2) The increase is about the same for boys and for girls, 
and for independent and public school candidates. 8) As would be expected, 
candidates receiving high scores on the first test make smaller gains upon re- 
peating the test than do low-scoring candidates. 4) The increases in scores ате 
due largely to growth in the ability measured rather than to ‘practice effect of 
having taken the test before.” (College Entrance Examination Board, 1943b, 


р. 10.) 
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have faced the problems of admission to college and of succes- 
sively meeting its academic requirements in a manner which will 
stand up under such scrutiny. Hence the typical constructor 
or employer of achievement and aptitude tests alike faces the 
practical task of measuring probable ability to learn in a given 
situation. He must use those pertinent criteria immediately 
available for judging the value of his test Scores; namely, per- 
formance in specific sub ject-matter courses. Objective measures 
of growth, however potentially valuable, unfortunately have 
not yet gained independent recognition as coin of the academic 
realm. 

Lindquist nevertheless states: “Whether from the viewpoint 
of guidance or of evaluation, both the construction and the 
administration of the tests should be wholly independent of the 
present curricular organization and content.” (1942, pp. 10— 
11.) It is only in rare circumstances that a test maker can take 
а position so aloof from the organization and content of cur- 
ricula. This may be accomplished, perhaps with much benefit, 
in а state-controlled educational system where admitting and 
guidance officers of its University are fully committed to these 
new principles of evaluation. Whether these will be accepted 
elsewhere as valid educational currency is a question not to be 
answered, in a practical or immediate Sense, even by assurance 
that the principles are sound. It is indeed to be hoped that they 
will eventually come to be accepted as superior to the credit 
system of gauging a student's merit; but to ignore traditional 
standards during the interim period of adjustment would seem 
unwise. 

Lindquist (idem, pp. 11, 23) also maintains that measures 
of past growth are the best indices of present aptitude. How- 
ever, this makes no provision for educational guidance of a stu- 
dent who may contemplate undergraduate work in some new 
area with which he has had little if ату precious contact. Under 
such circumstances, his previous opportunity to exhibit growth 
therein may have been decidedly restricted. The basic differ- 
ences between a testing program Which relies largely upon 
achievement measures, and one which utilizes aptitude indices 
of a broader sort, lies in relative emphasis upon capacity for 
future growth in untried as well as in familiar fields. From the 
titles in his battery as outlined above, it is of course clear that 
Lindquist’s series does measure certain attributes or tendencies 
in a differential sense. 
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Long-Range Values of the Functional Approach 


Lest these comments appear in any sense disparaging to the 
Iowa Battery, it should perhaps be stressed that they are not 
so intended and simply represent interpretive, by no means ad- 
verse, opinions. The Fall Testing Program for Iowa High 
Schools lends itself excellently to this sort of critical analysis, 
for it possesses admirably distinct characteristics. If a really 
thorough trial over a substantial period of time can be accorded 
this functional approach, changes in curricular organization 
at the secondary and college levels may eventually result. At 
present our curricula are highly “subjectized.” A student is of- 
fered the opportunity to take French, chemistry or calculus 
in much the same way that one is urged to take orange juice, 
Vitamin-B tablets or homogenized milk—all widely advertised 
products duly guaranteed as pure and beneficial by their re- 
spective sponsors. Any abilities that may be developed by such 
“taking” seem to be looked upon as direct concomitants of 
learning. 

Yet when the student enters business and civic life, much of 
his subsequent progress may depend upon just such functional 
efficiency as the Iowa tests emphasize; e.g., how well he under- 
stands what he reads; how clearly he expresses ideas in a let- 
ter; how effective he is in conversational dealings with other 
people; or even how quick at simple figuring. Presumably the 
subject-matter knowledge of a specialist is readily evaluated 
only by his fellow workers; yet the well-known man-on-the 
street decides in short order, from his own point of view, how 
well the specialist puts knowledge to practical use. If results 
obtained by the Iowa Battery come up to present expectations, 
they should gain wide recognition of corresponding functional 
objectives in curricular organization. 

]l-balanced, uniform program, a recent 


In contrast to this we r ә ] 
announcement of the Connecticut Cooperative Testing Pro- 


gram (University of Connecticut, 1945) lists 150 tests appar- 
ently available as individual schools or pupils may elect. The 
announcement implies that although most of these tests are well 
known, many are not directly comparable with others in either 
level of difficulty, purpose or form of scoring. With such variety 
of measures, the interpretation of results is likely to prove a 


complicated problem indeed. (Cf. p. 100.) 


128 Forecasting College Achievement 


GENERAL EDUCATIONAL DEVELOPMENT TESTS 
OF THE U. S. ARMED FORCES INSTITUTE 


Anyone familiar with the tests of General Educational De- 
velopment produced by the United States Armed Forces Insti- 
tute will recognize marked similarity to Part B of the Iowa 
program. The USAFI tests are published by the American ` 
Council on Education. They are described as follows in the 
Examiner’s Manual: 

“Two separate batteries of the Tests of General Educational 
Development have been constructed, one for use at the high 
school and the other at the level of the first two years of college. 
The high-school-level tests are intended to be used primarily to 
determine whether or not the individual has had the equivalent 
of a general high school education, or should be granted a high 
school diploma. The college-level tests are intended for use 
primarily to determine whether or not the individual tested is 
as capable of carrying on advanced college work as the student 
who has taken certain broad introductory or survey courses 
generally offered in the first two years of the liberal arts col- 
lege, or has reached the same level of general educational devel- 
opment as the student who has had such survey courses. The 
high-school-level battery consists of five comprehensive exami- 
nations concerned respectively with English composition, the 
social studies, the natural sciences, literature, and mathematics. 
The college-level battery is similarly organized, except for the 
omission of a comprehensive examination in mathematics; spe- 
cial examinations corresponding to various college courses 1n 
mathematics are provided instead of a comprehensive examina- 
tion at this level.” (United States Armed Forces Institute, 
1944, pp. 8-4.) 

Except for special examinations in mathematics, the college- 
level tests bear these titles: Test One: Correctness and Effec- 
tiveness of Expression; Test Two: Interpretation of Reading 
Materials in the Social Studies; Test Three: Interpretation of 
Reading Materials in the Natural Sciences; Test Four: Inter- 
pretation of Literary Materials. Unfortunately, confidential 
evidence has recently come to light that the GED tests have not 
always been administered under standardized conditions and 
that candidates have sometimes been “helped” to obtain high 
scores. If true, this is certainly a misguided type of aid to the 
individual student. This whole question of protecting standard- 
ized tests from misuse is discussed on pp. 222 ff. Cf. also Craw- 
ford and Burnham (1944) on trial of these tests at Yale. 
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MINNESOTA STUDIES IN PREDICTING SCHOLASTIC 
ACHIEVEMENT 


Extensive research upon achievement and general college 
aptitude tests is reported in the series entitled “University of 
Minnesota Studies in Predicting Scholastic Achievement” 
(1942). Well-known tests were used in the particular investi- 
gation concerned with 827 students in the College of Science, 
Literature and the Arts. To quote, these are: 

“American Council on Education Psychological Examination 
for College Freshmen, Form 1935; Ohio State University Psy- 
chological Test, Form 18; Minnesota College Aptitude Test, 
Form 1926; Cooperative English Test, Form 1935, Series 2: 
1933 equivalent scores, total score on Usage, Spelling, and 
Vocabulary; Cooperative Contemporary Affairs Test, Form 
198%; Cooperative General Science Test (High School), Form 
1934; Cooperative General Mathematics Test for High School 
Classes, Form 1934; Cooperative World History Test, Pro- 
visional Form 1934; Cooperative Literary Acquaintance Test, 
Form 1934.” (Univ. of Minnesota Studies, 1942, p. 18.) 

Without attempting to deal with this study in detail, the 
findings may be described briefly as follows: 

(a) High school percentile rank and ACPE scores yielded 
correlations with general criteria (first-year and two-year honor 
point ratios in all subjects) averaging somewhat over .50 sepa- 
rately. Multiple correlation for the two was .63. Corresponding 
multiples for the Ohio State Psychological Examination and 
the Minnesota College Aptitude Test, each also teamed with 
high school rank, were .61 and .62 respectively. In other words, 
all three of the prognostic tests contributed about equally to 
the evidence afforded by secondary school rank alone. No data 
as to ultimate yield of the entire battery are given. 

(b) High school percentile rank correlated better with 
honor point ratios in most of the specific courses than did either 
general aptitude or particular achievement tests. Correlations 
obtained between the ostensibly most similar predictive and cri- 
terion variables were all under .50, viz: Cooperative Literary 
Acquaintance Test with English literature honor point ratio, 
49; Contemporary Affairs and World History Tests with 
analogous performance ratings in the social sciences—history, 
economics, political science and sociology—from .34 to .44; the 
General Mathematics Achievement Test, .47 with grades in 
mathematics and .42 at best with those in physical sciences; the 
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General Science Test, .43 with honor point ratio in biological 
sciences, etc. 

The data reported are what one might expect for high school 
ranking and scores on tests of the general scholastic aptitude 
type. It is rather surprising, however, to find no higher correla- 
tions between test scores of specific earlier achievement in diff er- 
ential fields and appropriate criteria in correspondingly specific 
freshman courses. Two influences restricting the magnitude of 
all such correlations in this total situation at the University of 
Minnesota, as at many other institutions, are: 

(1) А marking system with only five intervals (A, B, C, 
D, F) which restricts the range of possible grades assigned. 
Because of this restriction in range there tends to be what is 
known statistically as coarseness of grouping. Ап example 
thereof was given in Chapter II in connection with discussion 
of percentiles (p. 37). When grouping is by circumstance so 
coarse, discrimination among students is necessarily limited. 
The grade B may deservedly be assigned to many students, 
but in so doing it is impossible to tell which ones are slightly 
above C and which just below A. Such а marking system exerts 
à depressing effect on the size of correlation with any associated 
factors, such as test scores. m 

(2) Тһе familiar bugaboo again of low criterion reliabilities, 
a factor of basic importance to aptitude testing if the latter is 
dependent upon such criteria for its own validation. Reasons 
for the lack of acceptable reliability are many ; varieties in stu- 
dent performance and subjective factors in grading are per- 
haps the most obvious. Я 

Following all too brief а discussion of these two restricting 
influences, the Minnesota monograph cited states in part: 
6, , . just so long as such an unstable criterion as grades 15 
used, validity coefficients higher than approximately .72 may 
never be attained. Only because no better criterion was avail- 
able were honor-point ratios used as the best measure of col- 
lege achievement in the present studies, 

“However, the possibility of another criterion of college 
achievement was investigated as a subsidiary problem. The 
Sophomore Culture Test was administered to 138 students 
applying for admission to the senior division of the arts col- 
lege.” (University of Minnesota Studies, 1942, p. 27.) This 
comprised the American Council Cooperative General Cultures 
English and General Science tests for college students, as earlier 
described. For a battery of teamed Prognostic measures (high- 
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school percentile rank, Minnesota College Aptitude, Coopera- 
tive English and Contemporary Affairs tests) the multiple 
correlations obtained were .67 with two-year honor point ratio 
and .86 with the Sophomore Culture Test. At first glance the 
latter value seems remarkably high. In this situation, however, 
considerable similarity exists as to nature and content alike of 
the several prognostic and criterion measures; hence the mul- 
tiple coefficient in question appears to have been spuriously 
raised (to a considerable, though indeterminate, extent) 
through noticeable overlapping between the earlier and later 


batteries employed. 


Conclusions from the Minnesota Studies 

rom these extensive studies that neither general 
sts nor specific achievement measures can 
be expected to correlate better than about .45 with college 
marks at the Minnesota College of Science, Literature and the 
Arts, even in the particular courses for which effective differen- 
tial prognosis would seem most likely. If that is true for various 
test combinations successively loaded with appropriate indices 
of achievement, it would seem to place corresponding limits 
upon the validity coefficients obtainable under parallel circum- 
stances from more broadly differential aptitude measures. This 
point will be recalled when findings with reference to the Yale 
Aptitude Battery are presented in the following chapter. 


Another interesting conclusion with regard to the relative 


values of achievement and aptitude tests is reported: “Тп short, 


the independent contribution of an aptitude test predicted col- 
lege scholarship nearly as well as a battery of six achievement 
tests. The necessity for economy of time and money would 
therefore favor the use of the shorter aptitude test in combi- 
nation with high school percentile rank if only one test is to be 
used.” (University of Minnesota Studies, 1942, p. 26.) 

The aptitude test employed in this and other Minnesota 
studies was distinctly of the general type. Hence, implications 


of the statement just quoted apparently have little reference 


to differential prediction. Yet this is not entirely overlooked by 
authors of the Minnesota study, as evidenced by the following 
statement appearing in the Foreword (idem, p. iii) written by 
Associate Dean T. R. McConnell: «Prediction of success in col- 
lege should be put on а differential basis as rapidly as possible. 
It is not enough to predict college achievement in general. In 
order to individualize students? programs so that they can make 


It appears f 
scholastic aptitude te 
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the greatest use of their particular interests, aptitudes and 
previous attainments, it is essential to have some estimate of 
their probable achievement in different courses or curriculums. 

Although the investigations reported here record progress in 
this direction, much more work on differential prediction must 
be done.” 

The question of how relative unreliability of school and col- 
lege marking systems limits correlation with predictive indices 
has been mentioned before and will be again, for it is an ever- 
present problem. It was noted above that grades in the College 
of Science, Literature and the Arts at the University of Minne- 
sota were expressed on a five-point letter scale (A, B, C, D, F). 
This is quite similar to the scale employed in the undergraduate 
schools of Yale University during the period represented by 
our next chapter, “A Sample Aptitude Battery.” Therefore, 
a parenthetical discussion on this topic seems appropriate here 
because it bears alike upon the immediately preceding discus- 
sions and that shortly to follow. 3 

There seems to be little doubt that the correlations for vari- 
ous tests employed in the extensive Minnesota study and for 
aptitude measures in the Yale experiment, with grades-in- 
course at the respective institutions, were alike restricted (at- 
tenuated) by unreliability of those criteria. The point merits 
further consideration. Even though most educators are now 
aware of such shortcomings in subjective grades, this was not 
always the case. Galton gives an interesting account of the 
marking system used in the Mathematical Honours examination 
at Cambridge, shortly before his publication of Hereditary 
Genius: 

“The examination lasts five and a half hours a day for eight 
days. All the answers are carefully marked by the examiners, 
who add up the marks at the end and arrange the candidates 1n 
strict order of merit. The fairness and thoroughness of Cam- 
bridge examinations have never had a breath of suspicion cast 
upon them. 

‚ “Unfortunately for my purposes, the marks are not pub- 
lished. "hey are not even assigned on a uniform system, since 
each examiner is permitted to employ his own scale of marks: 
but whatever scale he uses, the results as to proportional merit 
are the same.” (Galton, 1870, p. 18.) 

In the subsequent discussion, Galton dwells much on what he 
calls “the very curious theoretical law of ‘deviation from an 
average." > It is significant that at no time did he raise the ques- 
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tion of the reliability of marks or grades. This was left for Pear- 
son, Spearman and others to raise at the turn of the century, 
when interest in correlation brought this problem to the fore. 
Despite its importance, surprisingly little objective evidence 
has been published about grade reliabilities. Studies at Yale 
(Wolf, 1939, p. 16) reveal that, for the class of 1986, correla- 
tions between first and second semester grades in the same 
freshman courses ranged from .81 to .89, the median coefficient 
being .84. The marking system in use at that time was a nu- 
merical one with grades reported in intervals of 5 (i.e., 50, 55, 
60, 65, etc.). A somewhat different situation was found when 
data for the class of 1945W were investigated. The marking 
scale in effect when this class were freshmen was a letter sys- 
tem employing the five grades A, B, C, D, F. It is interesting 
to note that the term to term reliability coefficients for this class 
ranged from .59 to .85 with a median coefficient of .73. There 
was thus a reduction of .11 in the median coefficient of reliabil- 
ity when this letter marking system was used, in contrast with 
the situation when a numerical system with an increased number 
of intervals had been employed. 
Although it may not be entirely appropriate to attribute all 
of this decrease in reliability to coarseness of grouping, never- 
theless that factor is probably a contributing one and cannot be 
ignored. Another factor of probable importance is the disturbed 
conditions surrounding the wartime accelerated program, of 
which the class of 1945W was the first to feel the influence. 
However much we may be able to account for the decreased 
reliability of freshman grades in the class of 1945W, it does 
not alter the picture so far as the problem of securing high 
validity for a differential test battery is concerned. The prob- 
lem is a pressing one. Whatever may be the true meaning of 
these findings, it is certainly evident that freshman grades pro- 
vide at best a moving target for the aptitude tester, whose bat- 
tery must still be fired without benefit of radar. vd 
Тһе point of this digression is: increased use of objective 
achievement tests, such as have been discussed in the present 
chapter, should serve to improve and stabilize the criteria of 
scholastic progress. We urge their employment in conjunction 
with, and supplementary to, subjective judgments since ОК 
types of appraisal seem necessary to full, well-rounded evalua- 
tion. Every real gain in that direction would decidedly further 


the development of differential aptitude measures, now to be 


illustrated by one sample battery. 


CHAPTER V 


A SAMPLE APTITUDE BATTERY 


programs, and some exposition of their nature, have now 

been presented. The chapter following this one will con- 
sider distinctly more basic media for appraising individual dif- 
ferences—e.g., in the sense of “primary mental abilities” or 
“unitary traits"—Aas contrasted with indices of specific knowl- 
edge such as have just been discussed. Meanwhile, attention 
will be directed toward one organic series of educational apti- 
tude tests. This battery occupies a middle ground between 
rather narrowly prescribed achievement indices and measures 
of intellectual “factors” which (because of their well-nigh uni- 
versal applicability throughout numerous mental tasks) are so 
broad as to seem nearly flat, for practical guidance purposes 
Chapters IV, V and VI thus successively proceed from quite 
particular (achievement) through intermediate (aptitude) to 
somewhat theoretical (primary or unitary trait) categories. 

Admittedly, neither the idea of a battery nor that of express- 
ing separate scores in profile form is novel. In Chapter ІП, 
Thurstone’s use of a profile for reporting scores on the first 
edition of his Psychological Examination was mentioned. In 
one form or another, batteries galore have for years been uti- 
lized in educational measurement. Yet the combination of these 
procedures so as to yield individual profiles which are both 
prognostic and differential—i.e., reflecting variable readiness- 
to-learn in disparate areas of collegiate study—is a compara" 
tively recent development. So far as the writer knows, it was 
not employed prior to initiation of an experiment along these 
lines at Yale University in 1932. 

Beginning with that year, successive trial of many tests on 
college freshman and secondary school groups has proceeded 
continuously, with periodic revision of the materials through 
frequent validation studies and item analyses of joint and in- 
dividual components. Although isolated measures of aptitude 
for one or another field (e.g., music, mathematics, languages 
or mechanical drawing) have long been utilized in many places, 
the organization of various, disparate indices into a battery 


G essen examples of important achievement-testing 
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taken alike by all candidates and subsequently reported on a 
uniform, comparable scale for all parts represented a new de- 
parture. Even now, this procedure in terms of a prognostic test 
series is uncommon; the current primary mental abilities 
Тра бле), Iowa (Lindquist), College Entrance Examina- 
tion Board and Yale Educational Aptitude batteries, in their 
several ways, perhaps best exemplify it. Another recent ex- 
ample is afforded by the Armed Forces Institute General Edu- 
cational Development battery. Excellent as this is, it provides 
по quantitative measures, being largely verbal in nature. 

The various instruments progressively selected or modified 
through experimentation from materials originating elsewhere, 
plus those constructed de novo as part of the Yale Educational 
Aptitude testing program, after a decade of research are still 
far from ideal. They nevertheless may serve as examples of one 
approach toward the goal of effective differential forecasting 
through methods of practical usefulness to student counselors. 
While this goal has by no means been won, substantial gains in 
its direction are evidenced by various objective data, later of- 
fered. 

In preceding chapters the claim has repeatedly been made 
that aptitude tests can supplement inventories of formal scho- 
lastic achievement by indicating (at least for some areas) а 
student’s prospective learning capacity, which his previous 
courses may have tapped but slightly, if at all. Selected, but 
appropriate and fairly characteristic, individual cases will 
shortly be presented to illustrate specific operation of the Yale 
Battery. For some of the most effective testing methods or ele- 
ments employed in this series, indebtedness more extensive than 
subsequent references may fully specify is gratefully acknowl- 
edged. The writer pleads due consciousness of this battery’s 
shortcomings in many respects: utilization thereof as an ex- 
ample of aptitude testing implies nothing as to its own possible 
merits. However, sufficient utility for individual counseling pur- 
poses has already been demonstrated by this group of tests to 
suggest that some analogous type of measurement may soon 
become recognized as contributing markedly to educational 
guidance. If undue attention (not only to immediately follow- 
ing examples of the Yale Battery in action but also throughout 
later chapters) seems directed toward local findings, that is 
primarily because of the writer’s obvious dependence upon 
such data so far as personal observation and reporting are con- 
cerned. The individual test profiles which will subsequently ap- 


186 Forecasting College Achievement 


pear are thus drawn from direct experience with the tests illus- 
trated and specifically discussed in this chapter. 


DEVELOPMENT AND COMPOSITION OF THE 
YALE BATTERY 


After some years of preliminary experimentation with care- 
fully selected freshman test groups, a tentative aptitude bat- 
tery was given to all entering Yale freshmen in the fall of 1938. 
With successive modifications thereafter, based upon yearly 
analysis of the results obtained, a similar set of tests has since 
been regularly administered immediately upon matriculation.’ 
With few exceptions, all entrants had earlier taken the College 
Entrance Examination Board’s Scholastic Aptitude Test, 
which was accordingly utilized to appraise verbal facility. 
Other measures were “anchored” to that index; i.e., initial 
scores on the various differential tests were redistributed to 
conform with its mean and standard deviation for all matricu- 
lants. The resulting transmuted indices were therefore readily 
comparable with the basic Verbal Factor scores, and likewise 
with each other, as separately expressed on individual profiles. 

This battery comprised the following elements: 

I. SAT—Verbal Facility (The College Board Scholastic 
Aptitude Test, Verbal Section, taken prior to matriculation and 
subsequently described in Part II). 

П. ALT—Linguistic Aptitude, as measured by an Arti- 
ficial Language Test of new design. 

ІП. VRT—Verbal Reasoning (logical inference, deductive 
judgment, etc.). 

IV. QRT—Quantitative Reasoning (ability in manipulat- 
ing hypothetical quantitative data so as to perceive relations OT 
principles characterizing them and derive “laws” analogous to; 
yet different from, those actually encountered in study of the 
natural sciences). 

V. MAT—Mathematical Aptitude (in some years ap- 
praised by College Board materials, such as constitute the 
present mathematical section of its Scholastic Aptitude Test; 

1. Owing to the war-accelerated program, matriculants in July, 1942, number- 
ing about 1,050 (Class of 1945W), were not thus tested until midterm, in Au- 
gust. Because of earlier modification in the College Entrance Board program» 
substituting certain objective tests for previous essay examinations, the Yale 
Freshman Battery was changed accordingly. However, a. "profile? embracing 
either six or seven measures (depending upon the matriculant's option among 
the earlier College Board tests) was provided for all entering students in that 


class. This procedure has since been extended to include Army and Navy Col- 
lege Trainees along with civilian freshmen on a comparable basis. 
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in others by independently developed measures of similar pur- 
pose). 

VI. SVT—Spatial Visualizing (representation of three- 
dimensional forms by two-dimensional figures through “projec- 
tions," block-counting, etc.). 

VIL MIT-Mechanical Ingenuity (problems in gear or 
pulley movements, structural stability and mechanical opera- 
tions). 

Sample items employed in these varieus instruments are il- 
lustrated by a Practice Booklet (normally issued to all partici- 
pants in advance of their actual testing program) and repro- 
duced in Appendix A. ‘This booklet includes representation of 
a Verbal Factor Test for use with secondary school students 
or others for whom College Board Scholastic Aptitude Test 
verbal scores are not available. 

Except for two sections (verbal and mathematical) which 
require considerable modification for different levels, this series 
of educational aptitude measures seems to operate with increas- 
ing effectiveness throughout a four-year academic range—i.e., 
from tenth grade to the college freshman year. Appropriate 
be utilized at successive educational 


norms must, however, А 
tude, as herein defined, rep- 


stages; indicating again that apti > 
resents facility in new application of earlier achievement, rather 
than a differential capacity so inherent and fundamental as to 
be little affected by päst experience or formal learning. Impact 
of the latter varies considerably among the seven (more or less 
disparate) major areas which this battery attempts to cover. 

For example, previous education exercises maxımum influ- 
ence upon the verbal and mathematical aptitude measures. Dif- 
ferences in the respective means of unconverted (raw) test 
scores between tenth-grade and college freshman students, con- 
versely, are least for the spatial and mechanical measures ; prob- 
ably because the functions they represent have not generally 
been accorded much chance for development by school cur- 
меша, On the other hand, earlier familiarity with machine tools 
or employment as & draftsman may raise some students? per- 
formance on these particular tests above the levels they attain 
elsewhere throughout the battery. This has been particularly 
noticeable in the case of recent Navy V-12 students transferred 
from fleet service, where they frequently had specialized tech- 
nical training and experience. А probable explanation of such 
results is that working experience offers the best opportunity 
for growth of aptitude in nonacademic directions, just as for-. 
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mal curricular requirements at school and college alike have 
long afforded particular recognition of promise along tradi- 
tional lines. 


MAJOR AREAS REPRESENTED 


“ Returning to specific consideration of this battery, it should 
be noted that the seven individual scores thereon шау be con- 
sidered as having (roughly) the following directional signifi- 
cance: I, II and III toward liberal arts study; III, IV and V 
toward pure science and mathematics; V, VI and VII toward 
some branch of applied science, such as engineering. It is evi- 
dent that the intermediate (scientific) group of indices over- 
laps academic areas on one side through Test III (Verbal 
Reasoning) and technological fields on the other side through 
Test V (Mathematical Aptitude). This situation is quite rea- 
sonable, since it represents the inevitable overlapping among 
these fields which is apparent from consideration of their re- 
spective curricula. In fact, attempted distinction at these edu- 
cational levels between relative aptitude for pure or applied 
Sciences might be regarded as of somewhat questionable sig- 
nificance. However, according to evidence thus far obtained, 
comparisons of performance on the first two or three tests in 
this series versus the last three or four show a rather high de- 
gree of validity. 

The intermediate, two-sided zone has proved specifically dis- 
criminating for a considerable number of individuals, even 
though group contrasts are less clear-cut than between аса“ 
demie and engineering areas. 'l'he situation in this respect has 


long been complicated at Yale because science majors were 


available both in Yale College (the established Academic De- 
partment) and, with 


a higher degree of concentration, in the 

Sheffield Scientific School. The latter, for nin years past, has 

alo offered majors in applied economic science and industrial 

ар which, conversely, were not true science majors- 

b. i cent 15 greatly to confuse the criteria for pre-science 
esaman groups throughout Subsequent discussions. 

FORM AND N 


ATURE OF EDUCATIONAL APTITUDE 
PROFILES 


Before sample results il] 
battery of either individu 
toward one or another bac 
is necessary to consider the 


ustrating the performance on this 
als or groups (respectively headed 
calaureate degree) are presented, it 
method of reporting scores. For this 
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purpose, Yale freshman norms (Class of 1944) will be used, 
as representing the latest group not educationally affected by 
near prospects of war service. Some of the individuals repre- 
sented below were actually enrolled in other classes, but their 
profiles are represented on the same form, since successive class 
means and standard deviations remained practically un- 
changed throughout those years. 

Figure IV represents the actual graph of test scores made 
by a student whom we shall designate by the familiar appella- 
tion “John Doe,” evidently a most promising candidate for the 
B.A. degree, but weak in spatial visualizing and mechanical 
aptitude. The solid horizontal line indicates the class mean of 
554 (here expressed for purposes of simplification as 55) on 
the College Entrance Examination Board's Scholastic Apti- 
tude Test (Verbal Section) to which all other elements in this 
series are keyed. Dotted lines represent the interquartile range 
and (near top and bottom respectively) the 90th and 10th per- 
centile levels. 


Тһе report form, shown in full, is intended to be self-explan- 


atory. Separate test results are plotted as standard scores (i.e., 
standard deviation intervals from the class average) indicated 
on the left-hand vertical scale, and corresponding percentile 
ranks on the right-hand scale. These are also keyed to the known 
standard deviation (sigma) of SAT scores for the entire fresh- 
man class, actually 91 but here reduced to 9. Comments de- 
scribing the profile are intended for the benefit of student coun- 
selors.? These comments and «John Doe’s” profile, follow. 


RAPH (Figure IV): On the 
Examination Board, a score of 
sents the mean or average of 
re highly selected 


EXPLANATION OF THE PROFILE G 
regular reporting scale of the College Entrance 
500 (here expressed for convenience as 50) repre 
the entire pre-college group tested. For the somewhat mo 

2. It has been suggested by Professor Toops that the profile sheet here illus- 
trated should also indicate the standard error of measurement for each test. 
This is a function of the sigma score employed to express individual differences 
from the mean, and of the test reliability (cf. p. 57 of Chapter II on “Statisti- 
cal Methods”). In this battery, all original raw scores are transmuted to a 

ў ove. Reliabilities, as determined 


mean of 55 and a sigma of 9, as explained ab i 
by the split-half method with Spearman-Brown correction (сї. Chapter II, pp. 


54-55) have been about .95 for each of these tests. Hence the standard error 
of measurement for each is less than 2; ie the chances are roughly two to 
one that an individual’s true test performance would not deviate more than 
(+) 2 standard score points from that plotted for each section of his profile. 
This should not be taken to mean: however, that other and more gross errors 
extraneous to the test itself (such as fluctuations in effort or interest on the 
Part of individual subjects, for example) might not to an indeterminate 


degree affect the accuracy of obtained scores. 
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group of students admitted to the Yale Class of 1944, the mean score on the 
College Board Scholastic Aptitude Test (of verbal comprehension) is 55. In 
order to express scores on all of the aptitude measures on a comparable basis, 
this score of 55 has been adopted as the general mean, since it best represents 
the general level of ability for this freshman class, The scores obtained by each 
student are written below the profile lines representing the tests and these 
scores have been plotted on the graph. The test score scale, with the average at 
55 and ranging from somewhat under 30 to over 80, appears at the left side of 
the graph. The percentile equivalents are indicated at the right. 

The percentile scale gives the standing of each individual score in terms of 
the per cent of all scores which it excels. For example, if a person obtains a 
score of 67, he is at the 90th percentile, i.e., he has excelled 89 per cent of the 
group with which he was competing on this particular test, and 10 per cent of 
the group have made equal or still higher scores. From the chart it is possible 
to determine in which tenth or quarter of the distribution a given score be- 
longs. The dotted horizontal lines indicate the quartile points and, nearer the 
extremes, highest and lowest deciles. 

One-half of the scores (25th to 75th percentiles) fall between 49 and 61. This 
means that scores above and below these points are more significant as indica- 
‘tions of aptitude, or lack of it, than are scores falling within these limits, i.e; 
which tend towards the average. There are no passing or failing “marks” on 
these tests. The scores which are given simply rank each individual relatively 
within the entire freshman group. . 

In the interpretation of results, it should be borne in mind that individual 
differences, as revealed by tests of this type, are of two sorts. One reflects vari- 
ations among students comprising the group as a whole—i.e., relative rank; the 
other, differences in relative capacity which may occur within the same indi- 
vidual. When individual variances of the latter type are large—around 10 
points or more on the Scale—they are probably worthy of note for purposes of 
educational guidance. Variances of around 20 points or more in performance 
by the same individual on the different tests are almost certainly significant in 
this respect. Thus, the aptitude measures are intended to serve a double Dur. 
pose—first, to indicate the capacity of each student, as compared with his fel- 
lows, for the different types of thinking emphasized by the curricula of our 
respective undergraduate schools; and second, to reveal relative promise within 
the individual for these different areas of upper-class work. 

Whether a student ranks comparatively high, average or low on the tests as 
а whole is an indication of his ability in general; while, at each of these levels: 
considerable variation from one test to another will suggest the areas in which 
he is likely to do better or poorer, relatively to his average level. Wide differ- 
«ісе ee this sort and degree are the exception rather than the rule; but even 
ps ES reflecting them are in the minority, it should be recognized 
bial FERAS uals who do reflect them are the very ones for whom educa- 

TRE ASH E 18 most important. 4% 
aptitu T ЈЕ B norem so that the three indices most indicative of «се 
would naturall SEN Relatively high scores in the verbal and linguistic tes 
subjects lik "i point towards selecting, as a field of upper-class concentration, 

o * English literature, history, languages, or the social sciences. Con- 
versely, the three tests at the right are thos Part rp he study of 
Engineeri 2 2 е most significant for the study 
e re middle section, which overlaps to some extent each of the 
= жа) mh QU ES as pointing towards mathematical or scientific work 
hes d iA an towards either academic or engineering studies. These 

S may roughly be taken as representing, from left to rights 


i i “ ^ avis 5 
ASIA NO promise for “academic,” scientific ог engineering fields of concentra 
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PROFILE GRAPH SHOWING INDIVIDUAL PERFORMAN CE 
ON EDUCATIONAL APTITUDE TESTS 
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Interpretation of Personal Scores 


Portraying an individual’s test profile in this manner makes 
it possible for his counselor to evaluate relative promise for 
various upper-class majors which the student may be consider- 
ing. Thus John Doe ranks within the top 5% or 10% of his 
class on all three verbal-linguistic tests and nearly at the 70th 
percentile in quantitative reasoning. While less outstanding in 
mathematics, he should be competent to handle that subject, 
since his aptitude therefor is slightly above the class average. 
On the other hand, he is likely to find mechanical drawing and 
engineering troublesome. He would certainly not be making the 
most fruitful use of his undoubted talents by pursuing techno- 
logical studies. Of course J ohn need not concentrate in English 
or foreign languages just because his top scores lie in that gen- 
eral area; if more interested in government, history, economics 
or philosophy, for example, he should be able (from this evi- 
dence) to attain honors therein with requisite application. 

To carry this illustrative case further, let us suppose that 
John’s first term grades are: English B, history C, Spanish D, 
mathematics (% chemistry D. He is then obviously not perform- 
Ing at all up to scholastic capacity. Why not, of course, the 
data fail to reveal; but the fact remains. A student might, 
through lack of interest or physical malaise during the test ses- 
sions, fail to do justice to his aptitudes as measured by such а 
uem ; that is, he could make a decidedly poorer test record 
b ап properly represents his гелі capacities. But he cannot even 
ier ae make a substantially better one. This particular 
ra E 3 $ 18 a very able lad, especially in certain fields. The con- 

у etween his profile and the grades postulated above would 
Sive a dean or counselor definite, objective proof of the stu- 


: ( 
dent's academic lethargy. John may prefer to exert indubitable 
other lines—“heelin 


to regard the pursuit of learning as sec- 
ondary to other goals not, in the face of S dh ati- 
0 Superior classroom work; and his 
€ it clear that he isn't being fooled a 
ms of earnest endeavor. 
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Group Aptitude Profiles 


The succeeding graph indicates relative test performance of 
two contrasted freshman groups. As just explained, these are 
епке (from left to right) in order of differential aptitude 
тілі verbal (academic) through scientific to engineering 

ranches of the undergraduate curricula. Figure V shows the 
average profiles, on each of these seven measures separately, of 
uA freshmen in the Class of 1944 whose tentative prematricu- 
ation aims were to pursue upper-class work in (a) Yale College 
er (b) the School of Engineering; respectively. 

While the profile differences in Figure V may nob at first 
glance appear great, their significance becomes more apparent 
when one realizes that they represent group rather than individ- 
^e variations. The prospective liberal arts freshmen rank on 

e average about one-third of a standard deviation above the 
pre-engineering students in verbal factor (‘Test I), while en- 
gineering candidates are superior to 8 greater degree (by .7 to 
9 sigma) è on mathematical and spatial aptitude measures (V 
and VI). Thus the sum of observed differences, even for these 
pooled data, is quite marked—roughly equivalent to 85 per- 
centile points on the scale under consideration. 

The contrast in relative aptitudes between pre-engineering 


and prospective academic students is therefore rather clearly 


revealed, at least among Yale freshmen. Analogous graphs were 
ater to elect majors in the 


also prepared for entrants planning 1 
Sheffield Scientific School, and for matriculants undecided at 
the time regarding their upper-class programs. These group 
profiles (not reproduced here) show lesser variation, that for 
the undecided group being practically a straight line along the 
class mean. 
h To what extent particula 
ave produced or accentuate 
owever, secondary school and e 
cedures underlying the process of selecti 
colleges have usually placed far more emphasis upon common 
requirements for all students in the basic school subjects than 
upon individual differences іп relative promise for quite dis- 
Parate upper-class concentration areas. It therefore appears 
that aptitude tests of the type here exemplified reveal such 
3. E.g., a difference of _8.4 scale points (on the Sgad Teu r E 


ens of the Yale College and Engineering School student: 
andard deviation (9) in terms of which all scores are reported. 


r emphases in preparatory training 
d these differences is indeterminate. 
ntrance examination pro- 
ve admission to certain 
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PROFILE GRAPH SHOWING GROUP PERFORMANCE ON 
EDUCATIONAL APTITUDE TESTS 
(Class of 1944) 
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differences (for groups or individuals) more clearly than pre- 
matriculation indices customarily utilized in the selection of 
entrants have done in past years, for technological studies. 


FURTHER TEST PROFILES 
This point becomes more strikingly apparent when individ- 


ual, rather than group, profiles are considered. The succeeding 
ilustrative cases, all real, not hypothetical, were originally 
chosen to exemplify certain aptitude-test patterns (both differ- 
ential and “flat”) when these students were still freshmen. They 
have since graduated, and the following comments will there- 
fore include data as to subsequent majors elected, scholastic 
records and other pertinent information. It should be empha- 
sized that not one of the cases was selected ex post facto, in 
terms of later decisions or accomplishments. 

The graphs which immediately follow represent the aptitude 
test performance of three students who had indicated, before 
matriculation, that they would pursue a liberal arts course. All 
of these ranked practically alike (at the mean) in verbal and 
language facility ; but the first, W.G. (Figure VI), retained ap- 
proximately that same level throughout. The second, T.F. 
(Figure VII), was distinctly outstanding in the scientific area ; 
While the third, R.B. (Figure VIII), broke sharply downward 
from average on the left and superior on both reasoning tests 
to low on the right. 

Traditional entrance data, largely based upon verbal and 
language subjects which dominate the secondary curriculum, 
would be unlikely to reveal such differences mm relative educa- 
tional aptitudes. General scholastic predictions of these three 
Students and their freshman averages (both noted above each 
Profile) were respectively in close agreement and all within the 
middle range (70 to 18). 


Later concentration areas and scholastic records of these 


same students are interesting. W.G., for example, whose profile 
Showed little deviation from the mean on any test, appropri- 


ately took the middle way—applied economics, which is a func- 
tionally wide, rather undifferentiated major field. It is signifi- 
cant that all his individual course grades likewise were between 
70 and 80 (C or B). This total pattern, from all aspects, 18 


notably consistent. 


jecti titude 
4. Unde ditions. College Board tests of the objective (ар 
апа ae ate as seemingly offer better and more directly comparable 
discrimination in these respects than their precursors did. 
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PROFILE GRAPH SHOWING INDIVIDUAL PERFORMANCE 
ON EDUCATIONAL APTITUDE TESTS 
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PROFILE GRAPH SHOWING INDIVIDUAL PERFORMANCE 
ON EDUCATIONAL APTITUDE TESTS 


T. E. Crass 1943 
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T.F., despite a profile distinctly high in scientific aptitudes, 
elected the more traditional economics major in Yale College. 
His scholastic record there was little better than average. It is 
significant that after receiving his B.A. degree, T.F. entered 
the Sheffield Scientific School and completed one more under- 
graduate year, with honors, in the Department of Industrial 
Administration and Engineering. Perhaps that is where he be- 
longed in the first place. 

The third member of this trio majored in architecture as a 
Yale College student. This choice was hardly consistent with 
his test profile and low mechanical ingenuity score. His subse- 
quent scholastic record was well below average—in senior year 
when three of his courses were in the ma jor field, R.B.’s percen- 
tile rank was only 12. Here is a case which might be attributed 
either to serious failure in counseling or to stubborn resistance 
against it. Whatever the cause, R.B. was an educational misfit 
through circumstances which stacked the cards against him. 
Either his family or Yale—probably both—*let him down.” 

Compare with the foregoing another profile from the an- 
nounced “preacademic” group; that of E.T. (Figure IX). 
Quite low in verbal and language facility but relatively high in 
scientific promise, this entrant likewise was distinctly out of 
place in his originally intended course. However, on the basis 
of test evidence, E.T. subsequently elected industrial adminis- 


tration and graduated with a commendable record from the 
Scientific School. 


Basic aptitude testin 
by three graphs of fre 
Was, at entrance, 
average in verba] f. 


g objectives may be further illustrated 
shmen whose choice of an upper school 
undecided. The first, В.С. (Figure X), is 
‹ acility but distinctly high in the quantitative 
and mechanical areas. The second, E.L. (Figure XI), is equally 
onde within himself in these respects, but on a much lower 
dba throughout. Both appear from their test profiles as likely 
% E well m academic than in technological studies; al- 

ough high Positive aptitude for the latter characterizes the 
first case, while a low verbal score in the second leads to 
analogous conclu 


Sons on negative grounds. R.C. made a wise 
mathematics as hi 
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PROFILE GRAPH SHOWING INDIVIDUAL PERFORMANCE 
ON EDUCATIONAL APTITUDE TESTS 
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tion in outside activities but fell below the class average scho- 
lastically. 

The third member of this group, J.J. (Figure XII), repre- 
sents a student ranking well above the mean on all seven apti- 
tude measures, and within the top tenth on five of them. His 
spatial and mechanical scores are excellent; but so are those on 
the verbal side as well. That he experienced much difficulty in 
deciding upon a field of concentration was in fact due to his uni- 
form superiority ; the unusual person with so many talents may 
raise difficult problems in counseling for the very reason that he 
can do anything. J.J. finally majored in history, with moderate 
distinction. He strikingly represents the American prerogative 
mentioned earlier, of really concentrating on extracurriculum 
activities if one chooses to do so. In these, J.J. swept the board. 
For the others in this trio, scholastic indecision could be more 
easily resolved, once measures not previously available had 
clearly indicated variance within themselves of relative scholas- 
tic promise. 

Another multi-aptitude profile follows, for A.G. (Figure 
XIII). This again represents distinctly high scores through- 
out, ranging from over 1 to 3 standard deviations above the 
class mean. Although A.G. would probably have succeeded in 
any field of concentration, his announced choice of a scientific 
course was most appropriate to particularly outstanding rank 
in quantitative reasoning and mathematical aptitude. Inciden- 
tally, his subsequent performance in mathematics and science 
was remarkable. Even in sophomore year, he received А 
marks in two graduate mathematical seminars ; and thereafter 
all his work (except for one beginning course in French) was 
E Еа the Graduate School, with exceptional dis- 

The са simp er ppm in Naval Intelligence. 

p. TER | Specific illustrations are intended to suggest 
MEE us type may indicate differential promise by 
appr “sing something more than, or complementary to, specific 
аа in fields already studied. Achievement tests, as al- 
(a) Nie lee ened Ft ig to other instruments when: 
through formal edu tio n. шч "wie (алдар 
pene they 5. об cad oe в to the particular sub- 
ial Ұйық. e herea bd When prognosis of sequen- 
lated) field is sought o la а am the same (or a closely re- 
1 Sat. I cucational aptitude tests undertake the 

no less important task of supplying analogous prognosis when 
achievement measures (because of not satisfying either of the 
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PROFILE GRAPH SHOWING INDIVIDUAL PERFORMANCE 
ON EDUCATIONAL APTITUDE TESTS 
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two conditions just cited) are inappropriate. This they attempt 
by means of the miniature-test or novel-application techniques 
earlier discussed.^ 
Progress to their objective relies upon indirect methods, 
_ gauging relative educability in this or that area through a sam- 
pling of how readily past experience is applied to the solution of 
new problems. Each of the aptitude measures comprising the 
Yale Battery is intended (through making demands respec- 
tively typical of disparate higher-learning? areas) to measure 
facility in transfer of previous learning to a series of varied 
and unfamiliar situations. Since such tests thus proceed by indi- 
rection, the measures they yield are at best only approximations, 
or clues as to future performance. Yet these clues in many cases 
prove superior for guidance purposes to more firmly estab- 


lished, but sometimes less individually relevant, data of the 
more customary type. 


TOTAL AND SEPARATE ASPECTS OF THE 
YALE BATTERY 


Separate elements of this series, and analogous tests devel- 
oped elsewhere, will be individually considered later (Part II) 
with reference to particular areas of study: i.e., verbal and 
linguistic measures in connection with the liberal arts; reason- 
ing and quantitative thinking indices in parallel connection 
with the natural sciences ; mathematical, spatial visualizing and 
mechanical ingenuity tests with technological courses, etc. 
These factors overlap in some aspects but stand out as relatively 
independent in others. 

It seems appropriate to introduce here a sample of inter- 
а Еол among the various tests constituting the Yale Bat- 

егу. These data (comprising Table 8, below) represent scores 
made by members of the Yale Class of 1944 who took the tests 
at matriculation. Although these coefficients reveal nothing with 
ee to validity of the various tests, they do indicate the 
s ent to which they are interrelated or conversely independent. 

or example, Tests I and ПІ correlate .64. This is reasonable, 
5. Chapter I (pp. 6-7). 
6. Chancellor Robert M. Hutchins, 
E" has no di 
IRI шк moter ДЕГІ апа exceptional intellect in the highest sense, he 


--а twentieth-century model of the TUE PH vir Pg is 
Learning іп America, Hutchins, 1936.) 
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since they are both composed exclusively of verbal materials. 
In an analogous manner the .62 correlation between Tests IV 
and V is consistent with the quantitative reasoning content of 
these two tests. Conversely the .19 correlation indicates relative 
independence of Test I scores from those of Test VI. 


TABLE 8 


Intercorrelations Among Aptitude Tests Administered to 
Yale Freshman Class of 1944* 


Tests 

II III IV y VI VII 
Tests ALT VRT (ВТ MAT ӨРТ MIT 
SAT I Al 64 30 20 19 94 
ALT ТІ E 49 32 98 32 
VRT II .51 32 38 AA 
QRT IV 62 51 .61 
MAT V 49 150 
SVT VI 155 


* N = 856 except in correlations involving Test I, in which case N — 837. 


Test abbreviations are: 


I SAT—Scholastic Aptitude Test (Verbal) 
П ALT—Artificial Language Test 
II VRT—Verbal Reasoning Test 
IV QRT—Quantitative Reasoning Test 
V MAT—Mathematical Aptitude Test 
VI SVT-—Spatial Visualizing Test 
ҮП MIT—Mechanical Ingenuity Test 


ta based on other Yale freshman classes 
Show closely similar relationships; the pattern for that level 
consequently seems quite stable. Intercorrelations derived from 
1,966 secondary school students to whom this battery was admin- 
istered in Grades 10 and 11 form an analogous pattern, except 
that these secondary school correlations tend to run a bit higher 
than college freshman coefficients. This is particularly notice- 


able among Tests IV, V, VI and VII. Presumably the aptitudes 


of younger students do not tend to be differentiated to the same 


extent as they appear to be following two or three more years of 
education. T 

This pattern of interrelationships is not surprising when 
опе considers the progressive unfolding of curricular options as 
a student advances from secondary-school to college. In school, 
the verbal-linguistic area occupies & dominant position; we 
have repeatedly emphasized this aspect of college-preparatory 


Corresponding da 
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work (as distinguished from the more vocational and presum- 
ably terminal high school courses). Hence there is compara- 
tively less differentiation among the criteria at schools which 
have experimented with this battery than is later afforded in 
the college-freshman year and subsequently. Moreover, the 
educational aptitudes which it attempts to measure are prob- 
ably affected by training, whether formal or otherwise. 

Factorial analyses have demonstrated independence, even at 
secondary schools, among measures of three broad educational 
areas (viz: Verbal-Linguistic, Scientific-Mathematical and 
Spatial-Mechanical) quite encouraging for guidance purposes. 
However, some of the earlier tests administered at these schools 
were too highly speeded, thus introducing (as discussed in the 
next chapter) a “speed factor” of general nature which doubt- 
less operated somewhat to obscure those very differences which 
the battery was designed to identify. 

Attention will now be paid to experimental results obtained 
from the Yale Battery, considered not piecemeal but as a whole 
—that is, with respect to the varying relationship of test scores 
to appropriate differential criteria of accomplishment. The dis- 
cussion which follows necessarily anticipates some of the more 
detailed analyses subsequently presented in Part II, while the 
latter in turn will duplicate, more or less directly, certain of the 
basic test-validation data shortly to be presented. 


correlation data makes i 


within the individual frequently occur. Low scores, however, 
ably indicative than high scores. As every experi- 


whether ; › а student's performance— 
ег in regular classroom work, course examinations, 


achievement or aptitude ( $ 
short of his real P measures—at times may fall quite 


taneously, a few 


produce some quite misleading low 
otage his test record through care- 
t no one can substantially raise it 


ваја p.» Competence by chance, strenuous effort 
or even mere “tricks,” short of actual cheating. 


lessness or deliberation ; bu 
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It is important to establish, if possible, specific high and low 
critical levels for each measure with respect to appropriately 
related fields of study. These would be stated in probability 
terms such as: *The chances of obtaining an honor grade in 
freshman mathematics are three times as great for students 
scoring 58 or higher on Test V as they are for students scoring 
below 50”; or, “The chances of failing that course are four 
times as great for students scoring under 40 on the Mathemati- 
cal Aptitude Test as they are for students scoring above 55,” 
etc. Naturally, critical levels of this sort must be determined 
locally by each institution, with reference to its own standards 
and population. Several analyses in respect to Yale freshmen 
were pointed toward the establishment of fairly definite critical 
Scores, as just cited, pertinent for individual guidance. How- 
ever, when earlier investigation had just paved the way for ad- 
vance along these lines, wartime circumstances entered both to 
dislocate normal criteria and to interrupt the flow of necessary 
research. Hence the important question—what test-score levels 
(high or low) are most significant for positive or negative in- 
ferences of educational promise—can thus far be answered only 
in general terms. 

А score of 60 or higher on the profile scale previously illus- 
trated denotes positioc aptitude which should be encouraged. 
One of 65 to 70, or occasionally better, is even more important 
to note and utilize for guidance purposes. Conversely, scores 
from 45 to 40 represent danger signs; those under 40 are 
normally a “red stop-signal” with respect to any field for which 
the test in question is reasonably prognostic. 

In this and subsequent chapters, evidence as to directional 
validity or discriminating power of the Yale Aptitude Testing 
Program, like that respecting many other differential measures, 
Will be presented chiefly in the form of correlation coefficients 
obtained between separate test scores and differential criteria. 
However, mere correlations (even if accompanied by details as 
to their probable or standard errors, level and range of ability 
represented or additional factors bearing upon the interpreta- 
tion of results) afford but a partial means of estimating true 
validity among various prognostic tests. Limitations imposed 
upon this method of analysis, by reason of low criterion reliabil- 
ities and coarse grade intervals were mentioned in the preceding 
chapter. As certain data there given foretell, correlations be- 
tween Yale Aptitude Battery scores and later scholastic achieve- 
ment have consistently been reduced by the same attenuating in- 
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fluences which restricted coefficients in the Minnesota experi- 
ment. Complete data accumulated at Yale throughout some 
years are too extensive for complete presentation here; only the 
more recent findings will be summarized. 


COMPOSITION OF RECENT FRESHMAN BATTERIES 


Before any comprehensive tables of correlation among vari- 
ous aptitude test scores and grades in differential courses are 
presented, some general remarks are necessary: As noted earlier, 
the first and basic element (designated as I in the foregoing 
profiles and in subsequent tables of the Freshman Battery) for 
all classes has been the College Entrance Examination Board’s 
Scholastic Aptitude Test (Verbal Section). According to evi- 
dence subsequently offered, this measure has both high immedi- 
ate reliability and unusual long-range stability. Other prognos- 
tic materials employed in the Freshman Battery have varied 
considerably in nature throughout recent years. 

Validation data in Table 9, with respect to aptitude test 
Scores at matriculation and subsequent grades in various first- 
year courses, represent the Classes of 1944 and 1945, which 
were the latest to be relatively unaffected by withdrawals for 
selective service. The Mathematical Aptitude Test administered 
to the Class of 1944 was assembled by the College Board from 
its extensive file of pretested items, plus certain new (experi- 
mental) materials. For the entering Class of 1945, Test V of the 
Yale Battery (analogous in nature but differing in content and 
not so well standardized upon a large population) was em- 
ployed. 

After these preliminaries, we proceed to consideration of the 
evidence, as presented in Table 9, Its form, and certain of the 
data contained therein, will reappear in subsequent chapters on 
aptitudes in Part IT. Further details as to composition of the 
groups represented, means and standard deviations will then be 
given. Appendix B summarizes these data, with reference to 
the correlation tables which follow. 

In Table 9, certain sections have been printed in bold face 
type to indicate particularly appropriate relationships. For 
instance, the College Board Scholastic Aptitude Test correlated 
better with ап average of grades in English and history than 
with obviously unrelated subjects, such as mathematics and 
engineering drawing. Similarly the Artificial Language Test 
showed its highest validity with respect to beginning Spanish. 
Physics was best predicted from Tests III and V, while work in 
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TABLE 9 


Summary of Certain Validity Correlations for Aptitude Tests 
Administered to Yale Freshman Classes of 1944 and 1945 


Aptitude Tests 


T II III IV V FIL УБ 
Freshman Class No. SAT Artifi- Verbal Quant. Math. Spat. Mech. 
First-Term Verbal cial Reason. Reason. Apt. Visual Ing. 
Course Lang. 
Average of 
English and '44 290 49 .34 440 23  .16 9% ДТ. 
History 45 286 44 95 50 99 18 11-401 
Grades 
Spanish 10 74 62 46 57 40 42 41 —.07 24 
45 145 18 44 47 15 14 19 -.07 
Physics 10 744 55 40 2.57 42 .56 45 40 .94 
45 62 .31 43 .45 40 .51 185 40 
Average of 
Mathematics 44 246 16 98 94 49 42 46 337 
and Drawing 45 986 .25 .32 2.59 43 46 42 44 
Grades 


Engineering ^44 202 11 .07 (49 411 Л5 55 oi 
Drawing 745 998 ЛӘ 92 94 80 87 56 AT 


mathematics and engineering drawing was most closely related 
to test scores on the Quantitative Reasoning, Mathematical 
Aptitude and Spatial Visualizing tests. Accompanying low 
Correspondence between certain aptitude tests and inappropri- 
ate criteria (for example, between Mechanical Ingenuity Test 
Scores and average of grades in English and history) are no less 
interesting in a differential sense and evidence the generally 


discriminating nature of this battery. 


Review of Attenuating Factors 


Several elements, all tending to restrict coefficients among in- 
dividual scores and subsequent performance in freshman 
Courses, have been mentioned. These attenuating factors, of 
course, play their unwelcome role in all prognostic-testing situ- 
ations. They are not being stressed here merely to provide an ex- 
cuse for what at first glance may appear as rather low correla- 
tions reported in Table 9 between differential parts of the Yale 
Battery and their respectively appropriate validation criteria. 
Tn fact, the foregoing data and present discussion thereof are 


162 Forecasting College Achievement 


chiefly intended to suggest why and how various aptitude tests 
all operate “under wraps.” The most important external limita- 
tions imposed upon them at Yale during recent years have been: 

(1) Self-selection among student groups electing particular 
courses, with consequent restriction in range of their scores on 
certain tests. 

(2) A marking system which (with few exceptions) em- 
ployed only five intervals. 

(8) Low criterion reliabilities, as estimated from the correla- 
tions (.58 to .85) between first- and second-term grades in all 
freshman courses.” 

This combination of limiting circumstances may be regarded 
as а series of negative components, roughly analogous to such 
factors as weight, friction and observation errors complicating 
the effect of component forces in physics. The corresponding 
“Атар” upon positive elements, and uncertainties of calibrating 
а. scale of measurement, are obviously far greater in a mental 
than in а physical testing situation. Even general scholastic 
predictions, however refined, seldom correlate above .70 with 
freshman averages because reliability of the latter themselves is 
scarcely higher. 

General intelligence or scholastic aptitude measures taken 
alone, as indicated by Thurstone’s data and other evidence in 
Chapter III, infrequently correlate better than .50 with aca- 
demic-year averages, whose presumptive reliability exceeds that 
for each separate course. In the attempt to validate individual 
aptitude tests against individual and specialized criteria, it 
therefore seems reasonable to regard positive, wncorrected co- 
efficients of .40 or more as encouraging and of .60 as perhaps 
maximal. This statement is at variance with more rigorous 
strictures elsewhere set forth regarding the interpretation of 
such values (cf. Chapter II, p. 52). However, the circum- 
stances at hand, as just described, are rather unusual in their 
combined attenuating effect. "The claim just made of decided 
ee for coefficients of .40 to .60 must be judged in the 
ight of these special circumstances and the immediate guidance 
problem confronting many counselors. 

: e magnitude; i.e., correlations with appropriate versus 
inappropriate criteria, For example, a verbal factor test might 


yield a .60 coefficient with English or history grades. Well and 
7. Cf. Chapter IV, p. 183. 
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good—more can hardly be expected of it. But if this same test 
also correlates .50 with primarily nonverbal courses like mathe- 
maties or physics, its discriminating power is practically nil. 
It is then clearly operative as a general rather than a differ- 
ential measure, doing a pretty good prognostic job with respect 
to all-round promise but affording in this situation only faint 
clues useful for differential prognosis. Another test—e.g., of 
mathematical aptitude—might correlate only .40 with its re- 
spectively most appropriate criteria (classroom performance in 
mathematics or scores on some objective achievement test cover- 
ing the same ground). Nevertheless, if that same test consist- 
ently shows very low relationship to analogous performance in 
English and history, it is more discriminating than the former 
and therefore more valuable in differential prediction. 


TRIAL OF ANALOGOUS BATTERY WITH NAVY V-12 
FRESHMEN 


The Navy V-12 program beginning for freshmen in July, 
1948, has provided a situation in some ways unique for aptitude : 
test development. Six tests of differential scholastic promise 
Were administered for several terms to each incoming group. 
"Three of these tests were drawn from the Yale Battery and 
three from materials (analogous to other sections of this bat- 
tery) specifically developed by the College Entrance Examina- 
tion Board.? The several parts are described as follows: 


I. SAT—Verbal Factor (College Entrance Examination 
Board Scholastic Aptitude T'est, Verbal Section). 
ІШІ. VRT—Verbal Reasoning (Test III of Yale Battery). 
А IV. QRT—Quantitative Reasoning (Test IV of Yale Bat- 
егу). 
V. MAT-—Mathematical Aptitude (CEEB Scholastic 
Aptitude Test, Mathematical Section). 
- VI. SVT— Spatial Visualizing (CEEB Spatial Relations 
est). 
Я VII. MI'T—Mechanical Ingenuity (Test VII of Yale Bat- 
ery). 


8. Cf. Crawford and Burnham (1945). 

9. College Entrance Examination Board tests (Verbal, Mathematical and 
Spatial) given to Navy freshmen at Yale at the opening of the term in July, 
1943, November, 1943, and March, 1944, were designated by the CEEB as NY43. 
Their present counterpart is known as 5У-1. 
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The Freshman Year Program for V-12 students did not 
entail any work in foreign languages; hence omission of the 
Artificial Language Test from the Navy Series. Otherwise the 
latter is essentially the same as the Yale Battery. Attenuating 
influences seem to have been mostly under control in the V-12 
program for the following reasons: 

1. All basic V-12 freshmen (with few exceptions, such as 
premedical trainees) pursued a uniform curriculum. Thus 
there could be no variation in range of test scores among elec- 
tive subgroups. This presented a unique advantage for evaluat- 
ing the effectiveness of prognostic tests.!? 

2. Disparate standards of grading were in force throughout 
various institutions administering this same V-12 curriculum, 
and even among different departments within a single institu- 
tion." However, the Navy’s own screening Achievement Tests 
yielded objective measures of performance standardized on а 
nation-wide, though selected, population. 

3. Motivation generally was high among V-12 students, be- 
cause those who made a good record could definitely look for- 
ward to Midshipman School, leading in turn to a commission. 
There have been individual exceptions—notably among men 
detached from active units in order to fill Fleet and Marine 
Corps college quotas from the enlisted ranks-—who preferred to 
get back with their previous “buddies.” Yet the great majority 
did their best to qualify, in the classroom and otherwise, as of- 
ficer candidates. 

5 Table 10, which follows, has been prepared to show correla- 
ton of Yale Aptitude Test scores with grades in Freshman 
Year courses taken by V-12 students. In general it will be seen 
that correlations reported for Navy students were somewhat 
higher than was the case with earlier civilian students repre- 
sented in Table 9. Another point of difference is that, whereas 
less than 10% of the civilian freshmen normally elected be- 
,. 10. The writers hasten to add that, whether or not such curricular uniformity 


rvice training situation, they cannot regard it as desirable 
on. It provided, 


1 have no direct knowledge of, or access to complete records in 
this respect. Scores made by Yale V-12 candidates on thé several Navy 
Achievement Tests of 1943, 1944 and 1945 reflect considerable variation 
throughout different areas. Since the Yale average has consistently ranked 


within the Navy’s top third on these tests, it is clear that institutional varia- 
tions must also be extensive, 
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TABLE 10 


Aptitude Test Scores Correlated with Faculty Grades in V-12 
Freshman Courses 


(July, November, 1943 and March, 1944, Entrants Compared) 


Aptitude Tests 


ди ТУ E VI ҮП, 
Term I No.of SAT VRT QRT MAT SVT MIT 
Courses Students 
General Averages 
July Entrants 360 44 .50 .56 61 81 188 
November Entrants 11 098 45 .53 .55 50.4 
March Entrants 269 45 .52 .58 .63 44 48 
English 
July Entrants 853 48 47 27 3% 09 .01 
November Entrnts 168 87% B 021 20 = 
March Entrants от 64 54 35 297 19 16 
History 
July Entrants 959 51 46 35 40 04 19 
November Entrants: 168 42 <2 Sl 48 „ЗЕН 
March Entrants a45 49 47 AL 42 08 14 
Physics 
July Entrants аа 97 41 49 59 95 .87 
November Entrants 178 .25 59 48 -51 25 .86 
March Entrants 2046 52 46 40 44 . 95 31 
Mathematics} 
July Entrants 999 96 .97 48 252 90 18 
Мере анавы. ә). 8 28 022 
March Entrants ” 263 529 44 55 .62 95.94 
Engineering Drawing 
July Entrants 954 18 92 41 56 163 .50 
November Entrants 168 09 .22  .35 46 58 45 
March Entrants 246 11 17 459 4% 156 2.53 


* July 1943 entrants who had previously studied Physics were required to take а 
Physics Achievement Test instead of the Yale Mechanical Ingenuity Test. The num- 
ber of students represented in this table as taking the Mechanical Ingenuity Test was 
thereby reduced from 360 to 74. The Physics Achievement Test proved to be of limited 
Value in this situation and was subsequently discontinued. 1 

T Mathematics correlations were based on the combined groups in Mathematics. I 
and III. Coefficients were originally caleulated for each of these groups, but the dif- 
ferences between them were not sufficiently large or consistent to warrant their being 
Separately reproduced here. 
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ginning physics, this subject was required of all Navy fresh- 
men. In the Navy data, all correlations were based on popula- 
tions of substantial size. 

Certain sections in Table 10 have again been set in bold face 
to indicate particularly appropriate relationships. For instance, 
the College Board SAT correlated much better with grades in 
English and history than with less obviously related subjects 
—physics, mathematics and engineering drawing. Similarly, 
the Yale Quantitative Reasoning and the College Board MAT 
agreed better with performance in physics and mathematics 
than did other tests. College Board Spatial Relations and Yale 
Mechanical Ingenuity Test correlations with freshman grades 
in engineering drawing were encouragingly high. Accompany- 
ing low correspondence between certain aptitude tests and 
inappropriate criteria (for example, between College Board 
SAT and mathematics or engineering drawing grades) are once 
more noteworthy. Aptitude tests which correlated best with 
general average of all subjects in the V-12 freshman curriculum 
were: Verbal Reasoning, Quantitative Reasoning and College 
Board MAT. This finding partially reflects the extent to which 
physics and mathematics contributed heavily in terms of class 
“contact hours” to the general average for each Navy student. 
The general pattern among these relationships has tended to 
persist from group to group. 

‚ Conspicuously low relationship between the Spatial Visual- 
wing and Mechanical Ingenuity Tests and all courses other 
than engineering drawing (with which they correlated well) is 
quite im accord with expectations. Previous trial of many such 
three-dimensional visualizing and “mechanical insight”? meas- 
ures at Yale among successive civilian entrants has consistently 
indicated their unique value in this sense. No other educational 
тА м seems, at college-preparatory or freshman levels, bet- 
the 5 eren bed than the spatial factor (in which we include 
rum » we aspects represented by Test VII of the Yale Bat- 
jes z an апга, though of restricted nature, is neverthe- 
Spatial Mee ng within this particular area; well-constructed 

lons and mechanical ingenuity tests offer the best 
or engineering (mechanical) drawing and 


engineering requi 
riculum. 
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Correlations with Navy Achievement Tests 


Following their first term, certain achievement tests were 
administered to the V-12 group. These tests were of the objec- 
tive type and covered the following fields: English, physics, 
mathematics, and history (or chemistry for “pre-meds”). The 
College Entrance Examination Board, which had been com- 
missioned by the Navy to prepare these achievement tests for 
use on a nation-wide basis, reported separate and total scores 
in terms of percentile standing among all Navy V-12 students 
throughout the country. 

Before correlating reported percentile ranks with any other 
scores, it was necessary to “normalize” the percentile distribu- 
tions. At a later date it became possible to secure from the 
College Board raw scores for the July, 1943, entrants. Corre- 
lations based on these raw scores were found to be almost iden- 
tical with those derived from normalized percentile scores. 
Achievement test data reported herein (Table 11) were de- 
rived from normalized percentiles. The latter merely represent 
Conversion, by means of the normal probability curve illus- 
trated on p. 34, Chapter II, of reported percentile scores to 
equivalent Standard Deviation (Sigma) values, from which the 
correlations were then directly computed. 

Table 11 shows first term data for July, November and 
March entrants respectively. The correlations are between 
Yale Aptitude and Navy Achievement Test scores. Coefficients 
for appropriate criteria were without exception considerably 
higher than corresponding values in Table 10. Whatever may 
be the reasons for this, it is likely that increased reliability of 
the achievement tests and sheer test-taking ability (discussed 
later in this volume) were important factors. In practical terms, 
this simply means that aptitude tests did a superior job of fore- 
Casting probable scholastic achievement when the latter was 
reliably measured. But there are other important values which 
these objective tests may not have appraised ; those are largely 
Personal and their measurement sub jective in nature. For ex- 
ample, the Navy Achievement Tests obviously did not provide 
or measurement of such things as clarity of expression, origi- 
nality and logical organization of material. Cooperativeness 
and consistency of performance in class were probably not 
Measured, nor was persistence jn overcoming obstacles. These 
factors and many others, in addition to having the right an- 
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TABLE 11 


Aptitude Test Scores Correlated with Achievement Test Scores 
in V-12 Freshman Courses 


(July, November, 1943 and March, 1944, Entrants Compared)* 


Aptitude Tests 


Navy Achievement Tests E III IV Р VI VII 
taken at end of Term I No. of Р 
(normalized gile basis) Students SAT VRT QRT МАТ SVT MIT 


Total Score 


July Entrants 335 47 66 66 71 97 259 
November Entrants 155 63 .68 62 .68  .23 2.52 
March Entrants 935 .69 68 62 .70 83 21 
English 

July Entrants 335 .79 469 46 45 Ла 41 
November Entrants 155 .75 65 47 46  .09 33 
March Entrants 995 .83 (69 50 44 ла 81 
History 

July Entrants 399 60 .1 .58 48 04 51 
November Entrants 144 61 .51 от 99  .07 25 
March Entrants 935 45 41 26 за 16 25 
Physics 

July Entrants 985 .4 45  .53 .8 .51 .52 
November Entrants 155  .97 48 48 56 2.48 «67 
March Entrants 235 39 48 46 61 39 .59 
Mathematics 

July Entrants 80 57 ат 65 ла 38 3 
November Entrants 182 .16 41 .54 (468 16 41 
March Entrants 285 44 42 61 ЛА 39 AT 


* See footnotes on Table 10. 


swer, may have entered quite properly into the instructor’s 
grading process, but would have been less directly represented 
in achievement test scores. 

The same general pattern of appropriate relationships be- 
tween aptitude and achievement tests holds in Table 11 as was 
the case with courses of study in Table 10. The College Board 
SAT and the Yale Verbal Reasoning Test correlated highly 
with English and well with history. In fact, their correspond- 
ence with English Achievement Test scores ranged between .65 
and .83. High coefficients were also found when the Yale Quan- 
titative Reasoning Test and the College Board MAT were 
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compared with Navy Achievement Test scores in mathematics, 
and to a lesser but still impressive degree with Physics Achieve- 
ment. The Yale Mechanical Ingenuity Test also correlated well 
with the latter criterion. Unfortunately, the Navy Achieve- 
ment series did not include a test of mechanical drawing and 
descriptive geometry; hence no suitable objective index for 
validation of the spatial test was provided. Navy data were 
included here for two reasons: first because of their intrinsic 
value for differential aptitude measurement at the college level; 
second because they illustrate results obtained when a test bat- 
tery operates under more uniform circumstances than prevail 
with civilian students. 

Since the first printing of the present volume a new series of 
differential aptitude tests has been published by the Psycho- 
logical Corporation." Designed for purposes of educational 
and vocational guidance in grades 8 through 12 this battery 
consists of eight measures bearing the following titles: Verbal 
Reasoning; Numerical Ability; Abstract Reasoning; Space 
Relations; Mechanical Reasoning; Clerical Speed and Accu- 
racy ; Language Usage, Part I—Spelling, Part II—Grammar. 
Scores are reported in profile form. While time has not per- 
mitted the authors to gather and publish extensive validation 
data, they lay stress on the importance of such data. 

Тһе Yale Battery, having been developed to meet the needs 
of a particular situation under definite restrictions on time 
available, cannot adequately represent numerous other aptitude 
testing possibilities with perhaps broader scope. Therefore no 
attempt has thus far been made toward portrayal either of 
sample items or of specific techniques by which relative promise 
for disparate fields of study may otherwise be measured. These 
will, however, be discussed throughout later chapters, which 
Successively deal with differential prognosis in various direc- 
tions. Illustrations of pertinent material, together with selected 
references to scientific and educational literature bearing 
thereon, will be presented in Part II. First, however, we shall 
Consider another important approach to the goal of measuring 
individua] differential promise—that represented by research 


hs “unitary traits" and particularly on “primary mental abili- 
ies,” 


“Bennett, George К., Seashore, Harold G. and Wesman, Alexander G. 
Cony Manual of Differential Aptitude Tests. New York, The Psychological 
Orporation. 


CHAPTER VI 


UNITARY TRAITS AND PRIMARY 
ABILITIES 


general intelligence, differential achievement and edu- 

cational aptitude in that order. Before considering de- 
velopments of the latter type in relation to specific fields of 
study—as will be undertaken in Parts II and III—some atten- 
tion should also be given to the problem of appraising basic 
mental traits. 

The broad topic which they represent has been studied and 
fought over by philosophers and psychologists for many cen- 
turies; it long antedates the emergence of psychology as an m- 
dependent science. By comparison, the relatively new techniques 
of individual measurement and guidance are scarcely out of 
swaddling clothes. Obviously, no thorough consideration of 
these fundamental problems can be attempted in a volume 0 
limited and particular scope. The intent of sketching progress 
in achievement test procedures on the one hand, and of at- 
tempting to evaluate recent aspects in the long search for un- 
derlying mental factors on the other hand, is simply to place 
aptitude testing (as arbitrarily defined for the purposes of this 
volume) in its proper, intermediate setting. Brief mention of 
the currently most important factor theories, and subsequent 
detailed consideration of one (as represented by Thurstone’s 


Primary Mental Abilities research), seem advisable for the 
same reason. 


? | АНЕ foregoing chapters have severally discussed tests of 


Spearman—A Pioneer in Differential Measurement 

More than forty years ago (1904) the late Charles Spearman 
advanced his now famous theory of one general and various 
specific factors, which was to prove of momentous importance 
in the history of mental testing. This first challenged the naive 
early concept that mental ability represented a sort of biologi- 
cal (unifactor) entity subsumed under the category “intelli- 
gence. Spearman’s proposal was, roughly, that intelligence 
consisted of a permeating general factor coupled with various 

specifics (i.e., g plus өз... etc.) as selectively applie 

to different tasks. From this evolved the original “bifactor” 
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theory. Later that was extended to recognize the existence of 
“group factors"— such as verbal, numerical or mechanical— 
less general than g but more so than each s.* Hence the Spear- 
man theory now is really “trifactor” in nature, postulating g 
as always present throughout any battery of mental tests rather 
than as a universal attribute of all mankind under all condi- 
tions. Along with it certain less ubiquitous “group,” plus addi- 
tional more specialized s, factors are regarded as jointly com- 
prising the patterns of human intelligence. Presumably each 
of these has its quantitative ceiling for any individual compared 
to his fellows and also (within the individual himself) a qualita- 
tive ceiling in relation to the variable demands made by differ- 
ent tasks. Thus a person possessing certain quanta of general, 
group and specific abilities will apply them to greater or less 
extent according to respective situation requirements. It would 
be rather difficult, for example, to utilize even marked talent 
for numerical (group) or spatial visualizing (specific) factors 
advantageously in depicting the psychological quirks of Lady 
Macbeth; nor, by contrast, is high verbal ability required in 
accurate manipulation of a slide rule. The general intellective 
factor nevertheless is assumed to operate in some degree for all 
mental situations, ranging from quite simple functions to ab- 
struse scholarly research. “Non-intellective general factors” 
аз mentioned in the footnote below, supposedly have this same 
ubiquitous character; they represent one’s tendency to apply 
his abilities through stress and strain. 


Emphasis upon Group Factors 


Godfrey Thomson (1989) maintains, on the evidence of 
Sampling experiments, that group factors are predominant in 
Mental organization and of greater importance (especially for 
individual guidance purposes) than either Spearman’s g or s 
concepts. To the novitiate (and indeed to all but experts in 

actorial dialectic—which certainly excludes the writer) this 
theorem would seem not much at variance with Spearman’s tri- 
factor hypothesis; just a gentlemanly little matter of differing 
emphasis upon the middle-range group factors! Thomson’s at- 
titude toward various other mental-ability theorists is emi- 
nently fair and open-minded. For example, he does not depre- 


, 1. This theory now also postulates the existence of what may be called “non- 
intellective general factors”—perseveration, will, oscillation, etc.—which rep- 
resent aspects of character or “drive” rather than mental ability per se 
(Spearman, 1927). 
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cate or necessarily controvene either Spearman’s trifactor or 
Thurstone’s independent, multiple-factor concept. Such catho- 
lic regard for diverse views considerably enhances the force of 
his opinions. 

By the simple expedients of throwing dice and drawing cards, 
Thomson has demonstrated that several factor patterns result 
quite naturally from chance sampling of a number of posited, 
independent abilities. Any given test may sample a number of 
these abilities. Any two tests will be correlated, depending upon 
the extent to which there is commonality among the underlying 
elements represented in each. Thomson does not deny that there 
тау be general or specific factors, but regards them as special 
cases of group factors. Guilford (1986, pp. 466-471) points 
out that the sampling theory explains many known findings 0 
test experience, such as the principle that parts of a battery 
should correlate low with one another and high with the ст 
terion in order to obtain maximum validity. This sampling 
theory makes educational selection and vocational guidance 
logically possible. Early proponents of this view include Hull 

` (1928) and Kelley (1928). 

Some weighted group factor pattern, which regards even 
broad and complex abilities as variously applicable to different 
mental tasks, has current theoretical support from many non- 
partisan critics. No practical demonstration thereof, in terms 
of a test battery so conceived and administered for actual meas- 
urement and guidance purposes, has yet been made in this coun- 
try, if anywhere. Thomson’s approach, though speculative an 
philosophical, is also stimulating. It seems more objectively 
critical than that of other factorists concerned with vehement 


support of particular methods, but it still lacks pragmatic con- 
firmation. 


Multiple Factors 

Another system for describing and analyzing various com- 
ponents of human intelligence is the multiple-factor theory: 
Its major proponent is L. L. Thurstone, whose extensive Te 
search along these lines will later be reviewed in some detail- 
This theory, as originally expounded, postulates virtually com" 
plete independence of basic or primary mental abilities without 
recognition of either general or group factors. Like the others; 
it has been modified or “conditioned” since birth: despite 
spirited attacks and reprisals between Spearman and Thurs- 
tone, these former antagonists have more recently been sitting 
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dos-à-dos on the same fence. Since the multiple-independent- 
factor concept, as represented by Thurstone’s Primary Mental 
Abilities Battery, will receive major attention in this chapter, 
it requires no immediate further exposition now. 

The reader unfamiliar with these basic theories, but desirous 
of pursuing them further than is possible in such a cursory 
discussion, will find an excellent historical summary and ex- 
position in Guilford’s (1936) chapter on “Factor Analysis,” 
in Thurstone’s (1940) somewhat more technical article entitled 
“Current Issues in Factor Analysis," and in Wolfle's (1940) 
monograph, “Factor Analysis to 1940.” Cattell's penetrating 
review, “The Measurement of Adult Intelligence”—which to 
an unusual degree for studies of that type agreeably combines 
scholarship with readability and wit—contains the following 
statements: 

“A persistent cause of misunderstanding is the continued 
statement by some psychologists of the Spearman two-factor 
theory in the form it reached a decade ago rather than in the 
More developed form admitting group factors (i.e., a three- 
factor theory) as it appears in the later work of Spearman and 

olznger. ... 

“Realization of the common destiny of the Spearman and 
Thurstone approaches has perhaps been obscured by a certain 
intransigence in both parties to the discussion.” (Cattell, 1943, 
Pp. 168-169.) 

? Space unfortunately does not permit more extensive quota- 
tion here from this article or additional reference to Godfrey 
"Thomson's important work. Even the points made above are 

ut meagerly presented, out of their full context. 


UNITARY TRAITS 

Proceeding from theory to practice, we have first to consider 
elaborate researches carried on for more than a decade under 
бепега! supervision of the Unitary Traits Committee and 
nown as the “Spearman-Holzinger Unitary Traits Study.” 
lolzinger (1936b), as immediate director of these investiga- 
tions, states that the committee was formed in 1981, largely as 
the result of two major publications both already noted—The 
Abilities of Man by Spearman (1927) and Crossroads in the 
Mind of Man by Kelley (1928). To quote Holzinger: “The 
Problems and Plans Committee of the American Council on 
ucation empowered Professor Thorndike to act as chairman 

9f this committee and secured a grant of money from the Car- 
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negie Corporation for the purpose of preparing a plan to study 
unitary differential traits. The early members of this committee 
included Professors E. L. Thorndike, Charles Spearman, Т. L. 
Kelley, Clark Hull, Karl Lashley and Karl J. Holzinger. At 
later meetings Professors T. V. Moore, Henry Garrett, and 
Harold Hotelling were added to the Committee." (Op. cit. 
p. 335.) 

Major implications for the whole science of mental measure- 
ment are to be expected from this undertaking. A long-range 
program based on extensive testing of children at Mooseheart, 
Illinois, and at the Thorp Elementary School in Chicago has 
utilized Spearman's theory in a systematic pursuit of unitary 
са traits. Тһеіг presumptive nature has not yet been made 
clear. 

То quote Holzinger again: “Ideally what Professor Thorn- 
dike sought were the unitary traits, say 1 to 20, which would 
inventory all the primary factors in man's nature, and enable 
one to calculate all the derived traits.” Further: “The large 
sets of variables have been reduced to simple factor patterns 
involving a general factor and seven or eight relatively small 
group factors." (Idem, 1936b, pp. 836, 343.) As will later be 
evident from consideration of Thurstone’s Primary Mental 
Abilities Battery, these two originally (and at times bitterly) 
opposed schools of factorial thought at last seem to have con- 
verged toward essentially similar desiderata. 

It Seems rather strange that few references to this important 
Unitary Traits project have as yet appeared in psychological 
or educational literature. Occasional progress bulletins (chiefly 
technical m nature and reporting masses of statistical data 
with little interpretation thereof) have been privately issued 
i а series of Preliminary Reports (Holzinger, 1936a). Since 
epe no clear-cut results, it would be impracticable—if 
hs ur DUE n discuss that investigation dee 
E its evident potential value. Tentative factoria 

h n5, ав set forth in the Preface to No. 4 of this series; 
are however noted on page 176 of the present chapter. 


FUNCTIONAL TESTING 


Certain instruments developed by Tyler and associates for 
the Evaluation Study of the Eight Year Experiment (under- 
taken by the Progressive Education Association) should also 
be mentioned. Several optimistic accounts of the study as 8 
whole have been published (Tyler, 1942). This experiment of- 
fers another promising approach to the measurement of thought 
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processes, rather than of acquired knowledge. It suggests new 
ways of appraising such traits as induction, perception, “the 
nature of proof” (Fawcett, 1936), logical reasoning, judg- 
ment, analysis of data, quantitative thinking and other factors 
of educational significance, though expressed in other than cur- 
ricular terms. This may be called a functional approach to 
problems of measurement and guidance, in some ways akin to 
Lindquist’s functional organization of achievement tests, dis- 
cussed in Chapter IV. It is unfortunate that objective data in 
its behalf are as yet insufficient for presentation. 


MULTIPLE-FACTOR THEORY 


Since the scope of this chapter is necessarily limited, further 
discussion herein of unitary or basic intellectual traits will be 
confined to the Thurstones’ extensive research on Primary Men- 
tal Abilities—partially because of the wide attention it has re- 
ceived during recent years. Whatever opinions exist as to the 
practical result thus far obtained from that monumental proj- 
ect, there can be no question as to its far-reaching values for 
psychology and education. 

A sort of primer on the subject (introductory to subsequent 
technical reports) was the address delivered by Professor 
Thurstone (1936) at the Fifth Educational Conference in New 
York City and later published in the Educational Record. Data 
as to actual outcome of thé experiments then in progress or 
later planned have since been reported in two important mono- 
graphs (‘Thurstone, 1938b and 1941a). The latter and subse- 
quent publications naturally afford a major basis for com- 
ments and criticisms which follow. Anyone seeking more in- 
formation on methods of factorial analysis and its particular 
application to this experiment than superficial treatment here 
can offer should consult these original sources and the other 
references cited. The monographs are so frequently quoted in 
the present chapter that, for simplification of reference, they 
Will on occasion be designated merely as Monograph No. 1 or 

Onograph No. 2. 

қ According to evidence presented in these monographs and 
In other studies, Thurstone had isolated certain mental traits 
Ог intellectual abilities as primary and psychologically inde- 
Pendent factors. 'There seems to be some doubt whether this 
theoretical isolation is wholly realistic. Recently, Thurstone has 
Stated: “The primary abilities themselves turn out to be cor- 
related just as the original tests were correlated.” (1945, p. 7.) 

t is not clear whether he still regards the mental factors as in- 
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dependent, although not their expressed, mensurable products, 
the abilities. This point will be discussed subsequently, after 
progressive development and modification of Thurstonian re- 
search and theory have been considered. However, even grant- 
ing independence of the factors as such, their practical value 
for guidance purposes (at least in terms of demonstrable meas- 
urement) is of even more doubtful significance. There has been 
considerable misunderstanding of Thurstone’s own claims as 
to the validity (a) of the primary mental abilities themselves, 
and (b) of the tests as yet developed for their appraisal. More- 
over, in the course of several years’ investigation, even. the 
sacrosanct primaries have not altogether “stayed hitched. 

In his first monograph just cited, Thurstone presents elabo- 
rate statistical evidence to support clear isolation of seven 


primary factors, plus two others (R and D) more speculatively 
established, viz.: 


Primary Factor Symbol 
Spatial Visualization 
Perception 

Number Facility 
Verbal Comprehension 
Word Fluency 
Memory 

Induction 


MEZAZHuwv 


Restricted Reasoning R } “Tentative” 
Deduction D 


It is interesting to note that the Spearman-Holzinger Unt, 
tary Trait Study likewise isolates nine “assumed factors.” AS 
listed in Report No. 4 (Holzinger, 1986а, p. 1), these are: 


Assumed Factor Symbol 

Spearman’s g 

Verbality 

Mental Speed 

Motor Speed 

Oscillation 

Mechanical Ability 
ttention 

Mathematical Ability 

Imagination 


wit fo Б ь 4 09 
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This comparison suggests the extent to which leading in- 
vestigators vary in the nature of elements derived from their 
respective data and methods of analyses. Wide differences be- 
tween these two patterns are evident. To what extent such vari- 
ations are important or rather superficial (1.е., reflecting merely 
disparate labels for essentially parallel types of ability) thus 
far seems indeterminate. As will subsequently appear, Thurs- 
tone’s second monograph dealing with other populations offers 
new evidence and conclusions (even the presence of a general 
factor) somewhat in contrast with his earlier findings. 

It must be recognized that neither Thurstone’s nor Holzing- 
er’s test batteries nor their respective experimental groups 
were the same. However, differences in what they found prob- 
ably reflect to some extent differences in what they were look- 
ing for—e.g., independent and multiple, versus general and 
unitary, traits. These were sought not only through disparate 
methods but also through variant samplings, as represented by 
the basic measures and populations underlying Thurstone’s 
and Holzinger’s separate researches. 

Nevertheless, the foregoing comparison well substantiates 
our contention as to the partially subjective nature of factorial 
analysis and especially of nomenclature in selecting titles for 
the ultimate factors segregated. The two lists just cited, for 
example, suggest greater divergence in results from these no- 
table experiments than sampling differences alone should pro- 
duce if the respective factorial traits were either primary or 
unitary, according to normal usage of those terms. But this 
common usage does not often apply to factorial entities. It is 
unfortunate, though only natural, considering the rather loose 
Manner in which these titular designations are bandied about 
even by the specialists, that readers of educational literature 
take them at face value. Yet why should they not? “Primary” 
and “unitary” have an impressive, fundamental connotation 
which their respective sponsors have not adequately qualified, 
So far as the general public is concerned. 


EMERGENCE OF THE THURSTONE PRIMARIES 
Whether certain of Thurstone’s primary mental abilities 
(e.g., S and М) seem a priori to represent differential values 
While others (W and M, for example) do not, they are all sta- 
tistical entities, isolated by the mathematical technique of fac- 
tor analysis. This procedure essentially depends upon study of 
internal evidence—i.e., correlations among themselves of the 
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tests employed in these experiments—and not upon relation- 
ship to external criteria, such as course grades in various scho- 
lastic fields or occupational ratings. All factor analyses start 
from, and are controlled by, the correlation matrix. This is the 
basic table of intercorrelations prevailing throughout the vari- 
ous measures employed. Whatever factors or traits can be 
drawn from this matrix by any method is thus predetermined 
in the original sampling of these measures and of the popula- 
tions to which they have been administered. The several prima- 
ries, whatever their eventual significance may prove to be, 
simply emerge from the analysis of these basic data by reason 
of their factorial independence. 

Unlike other attempted indices of differential capacity, these 
had neither been developed for specific purposes nor pointed 
toward any predetermined objectives. Instead of postulating 
certain abilities or aptitudes and then seeking objective means 
to appraise them, this method begins with a segregation of 
anonymous traits, initially regardless of their nature. By Thurs- 
tone’s centroid method, subsequent axial “rotations” through- 
out various theoretical planes as determined by associative 
“clusters” lead to appearance of successive factors (I, II, III 
=== VI, VII ---) until the basic data have, as it were, been 
purged of significant residuals. 

Thurstone’s next Step was to scrutinize those tests which те- 
flected highest saturation for the respective primary abilities 
thus isolated and ask himself, so to speak: *What manner of 
traits are these I have found? What sort of mental process 
seems to distinguish one from another? What is common to 
the Several tests which are highly loaded with Factor I; or to 
this other cluster which hangs together in relationship to Fac- 
tor Ir? What psychological meaning do these separate out- 
croppings have?” 2 Chapter V of the first monograph, entitled 

Interpretation of the Factors,” gives a distinct impression of 
subjectivity at this stage. For example, in discussing the sec- 
ond group of associated measures, Thurstone states: “Our 


onograph No, 1 (p. 10). “When the primary axes have been discovered, 
1 problem of considerable Psychological interest to ascertain what each 
primary ability ls like. This is done by inspecting the tests which are then 
known to require the primary and the tests in which the primary is known to 
be absent. In case there is uncertainty about the exact nature of a primary 
factor, or in case two rival interpretations are made, it is possible by the 
Е iu р t is the question Оп an experimental basis instead of 

argument about rival classificatio, i i » (Thur- 
S dioi 1938Ь, p. 10.) ns of personality as in the past.” ( 


2.M 
it isa 
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problem is to identify, if possible, the psychological trait which 
is common to these tests and which is absent from the thirty- 
seven tests with negligible saturations for this factor. A hy- 
pothesis which agrees with introspective study of the mental 
operations essential in these tests is that the factor is essentially 
perceptual in character.” (1938b, p. 80.) 

It was through introspective analysis of the tests most clearly 
associated with the several factors that descriptive titles (Mem- 
ory, Verbal Comprehension, Word Fluency, Spatial Visualiz- 
ing, etc.) were selected, with which to christen the various 
primary mental abilities. Here is a good experimental instance 
of “letting the chips fall where they may” and subsequently 
looking for whatever patterns they form. There is, of course, 
nothing reprehensible about this procedure; it offers indeed a 
splendid example of pure scientific method. Moreover, the fact 
that no one yet seems quite sure of just what these intellective 
patterns are cannot be attributed to the shortcomings of any 
factorial system per se. 

Our criticisms in this respect are therefore specific rather 
than general, and deal especially with Thurstone’s positive defi- 
nition of certain factors bearing one or another generic label. 
The phrase, “clearly defined,” which he employs repeatedly in 
Psychometric Monograph No. 2 and elsewhere, conveys to most 
People a distinct impression of exactness not in fact substanti- 
ated when his published data are subjected to careful scrutiny. 
It is true that subjectivity pertains more to the final stage of 
deciding what functionally characterizes the associated tests, 
and how to label them, than it does to the earlier statistical pro- 
cedures by which the basic factors were isolated. Even here per- 
Sonal judgment enters in to affect the method of “rotating 
axes.” То some factorial problems there may indeed be no 
unique solution. 


Problem of Nomenclature 


The problem of nomenclature again demands attention: for 
example, one “clearly defined” factor is entitled Memory (M). 
Offhand, one might readily assume that M constitutes a primary 
ability (ie., memory in general) ubiquitous throughout edu- 
cational and vocational situations alike. Actually, Thurstone’s 

3. Cf. Wolfle (1940, p. 39) regarding certain limitations of factor analysis: 


“One is that the factorial situation is seldom unique; it is only one of several 
*qually possible solutions.” 
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memory factor is quite specialized in nature and admittedly 
represents rote memory, which seems to have little practical 
significance for either educational or vocational guidance. 
While he briefly acknowledges this fact in the monographs, 
and even suggests that factorial analyses should be made of 
other assumed memory factors at successive educational levels, 
a thorough reading of his several publications is necessary for 
one to appreciate how restricted the factor designated M really 
is. For example, the restriction is not made clear in 'Thurstone's 
Report Form to individual students taking the Primary Mental 
Abilities Battery. This “Individual Report on Primary Mental 
Abilities Tests,” 1938 edition, refers to “the belief that a good 
memory is an ability independent of other mental powers” and 
tentatively defines factor M as “the ability to memorize.” Such 
generalization does not seem warranted by the nature of lim- 
ited measures used to isolate it, or by its rote nature. 


APPLYING DIFFERENT FACTORIAL TECHNIQUES TO 
THE SAME DATA 


The seemingly wide contrast in results obtained by Thur- 
stone and Holzinger from their respective factorial studies Was 
attributed on page 177, largely to sampling differences (be- 
tween test batteries and populations alike) and to variations 
in nomenclature (which are sometimes more apparent than 
real). Certain of the disparate labels affixed by these investiga- 
tors to their several products have, in a sense, counterparts 
among the various trade names of commercially manufacture 
wares. Despite “scientific” advertising claims and the radio 
broadcasters? tear-jerking interludes, there is little to choose 
between one brand or another of cigarettes, hair tonic, beer oT 
ДЕ Presumably, these are all “good goods” and warrant con- 
a ПЕН вео even though none monopolizes in fact 
ү ique virtues claimed for it. That may also be true of dif- 
LR factorial methods when applied to the same raw material. 

ne important study, at least, indicates that the two divergent 


« ^ 
eere puc us Study is now in progress involving twenty-four dif- 
MOERS TRE eru tests which were combined with tests for the other 
dication. A th ee ysis of the results has not yet been completed, but the 10” 
rote еш ый ТАК, ГР tentive primaries will be found in addition to the 

-] ctor that we ћ d ` é 
{Ше ПЕТ ДЕ battery, Of RES ave denoted M. . . . The factorial results 0, 


i А + 5 y» 
(Thurstone, 1941a, p. 6.) y tests will be reported іп a later publication. 
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techniques yield fairly similar results when applied to identical 
basic data. Further analyses of this nature, directed toward 
reconciling rather than quibbling about the merits claimed for 
one or another factorial product or methodology, are needed. 

The study in question, by Holzinger and Harman (1938), 
analyzed Thurstone’s original Primary Mental Ability Tests 
by the Spearman-Holzinger modified bifactor technique earlier 
mentioned. Here was not the usual instance of contrasting two 
(or more) methods with rival claims resting upon different 
fundamental evidence. Holzinger and Harman simply, but no 
less significantly, applied another technique than Thurstone’s 
to the identical correlations already factored by his method 
and compared the findings. 

To a major extent these agreed with Thurstone’s, after 
reconciliation of certain functional terms, viz: number and 
arithmetic; perception and imagination; deduction and logi- 
cal reasoning; word fluency and completion. While not all ob- 
vious synonyms (and quite impossible to identify as even re- 
lated, from their letter symbols as reproduced in the following 
table), these different terms were indicated as essentially simi- 
lar by the respective test loadings of each analysis. 

Table 12 has been prepared to illustrate the operation of 
these two methods of factorial analysis when the same basic 
data are employed. It summarizes brief textual comments by 
the original authors with reference to tabular evidence which 
they also present. Since the latter is complex, giving various 
factor loadings in detail to support the characterizations as to 
degree of correspondence obtained, it is not reproduced here. 
That evidence however clearly substantiates the conclusions of 
Holzinger and Harman as to high agreement between the two 
methods of analysis with respect to six mental factors. 

It will be noted that Thurstone’s I and R factors are not ob- 
tained, even from his own data, by the Holzinger technique, 
which in turn yields two others (analogies and rhythm) “not 
represented in the [Thurstones’ | Multiple Factor Analysis.” 
There is little to suggest, statistically, that these represent dif- 
ferent brand names for functionally similar traits. 

As one might expect, Holzinger and Harman obtain a well- 
marked general factor because that is what the Spearman school 
first extracts from any set of correlations. While Thurstone 
has modified his original position to the extent of recognizing 
a “second-order general factor” in his later experiments, this is 
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" TABLE 12 
Thurstone’s Factors Compared with Those of Holzinger and 
Harman 
Holzinger and Harman's 
Thurstone's Factors Holzinger and Harman's characterization of 
Factors degree of agreement with 
Thurstone's Factors 

Spatial Visualization S || spatial 8 “almost perfect” 

Number Facility N || arithmetical m || "perfect" 

Memory M || memory о “remarkable agreement” 

Verbal Comprehension | V || verbal v || “remarkably close" 

Perception P || imagination i "quite comparable" 

(or perceptual speed) 

Deduction D | logical reasoning | 1 “agrees perfectly in its con- 
spicuous loadings” 

Word Fluency W || completion c (no comment by the au- 
thors; low partial corre- 
spondence indicated from 

P their tabular data) 
Induction I fino'counterpart?' 
Restricted Reasoning R (no-comment) 

analogies a "not represented in the 

rhythm rj || Multiple Factor Analysis 

general factor u “А formal difference in the| 
two analyses occurs in the 
case of the general factor 
which we obtain and which 
Professor Thurstone арраг- 
ently does not." 


Table 12 was adapted from Holzinger and Harman (1938, pp. 58-60) Note that the 


mim Қ шігіл and their corresponding symbols in this analysis differ from 
ose in Holzinger's Unitary Trait Study cited on page 176. Some of these varia- 


td = terminology and identification arouse a certain bewilderment as to how 
unitary or "primary" these factors really are. 


played down by his method 5 
tions almost dry in his initial 
The foregoing discussion 


of first milking the basic correla- 
search for independent pier 
ў regarding these two seemingly 

Mee factorial methods of analysis indicates (a) that the 
Lx Е d еш erent population samples and test materials may lead 
o widely divergent results, further complicated by subjectively 


method of factorial analysis per se does not bar the 
rder general factor, if the nature of tests employed 
correlations clearly embrace it (Wright, 1939). Thur- 
tilized in his PMA studies were 


5. Thurstone’s centroid 
identification of a first-o 
and their resultant inter, 
stone’s extensive range 
however initially select 
human ability. Hence 


- from 
his data derives from or does emerge at long last fr 


able residuum of “primaries.” 
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chosen variations in nomenclature and symbolism; and (b) that 
more essentially similar (though not identical) findings are 
reached when the two opposed methods analyze the same raw 
materials of correlation. 

"These points are illustrated by the contrast between almost 
complete divergence as set forth on p. 176, and the fairly close 
agreement reflected in Table 12. What differences there re- 
main can as well be accounted for by variations in the nature 
of testing materials employed as by any fundamental conflict 
between the factorial methods themselves. This point of view is 
supported by Thurstone’s recent comment (1945, p. 8) : “Our 
present interpretation is that there exists what we have called 
a second-order factor which is more fundamental than the pri- 
mary.” (Italics by the present writer.) 

If Thurstone should perchance designate this “more funda- 
mental” factor as primary, and the original primaries as of sec- 
ond order, the two opposed controversial theories would seem 
almost fully reconciled. To the layman in these matters, it now 
remains chiefly a question of whether one extracts the under- 
lying general factor initially and the specifics thereafter (as 
Spearman advocates) or follows Thurstone in first teasing out 
the differential factors and then recognizing predominance of 
a “more fundamental” residuum. Either approach leads to the 
conclusion that mental abilities comprise some sort of general 
intellective, plus various more specialized, factors. It need 
scarcely be added that great credit is due the Thurstones for 
their high scientific valor and integrity in pursuing a some- 
What tortuous path to their present conclusions. 


HOW AND WHY THE PRIMARIES VACILLATE 


It may now be appropriate to expand somewhat the earlier 
comment that even the primaries themselves *have not alto- 
gether ‘stayed hitched.’” Not only have the number and the 
designation of those earlier isolated undergone changes, but 
from inspection of testing materials and study of Thurstone’s 
monographs earlier cited it appears that somewhat different 
combinations have been successively employed to measure the 
Same basic abilities. This situation is naturally explained by 
the experimental nature of such a project and the desire to im- 
Prove each test composite (the group of instruments teamed 
‘Up for a given purpose) in the light of continued research. 
Changes in the list of primaries identified, and in the composi- 
tion of materials utilized for their determination, alike raise the 


x 
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question: Are these Primary Mental Abilities really primary? 
If they were—in the sense many people have been led to believe 
—even fairly wide variations among the successive experi- 
mental populations would scarcely account for their shifting 
alignment. The rationale of personal judgment in manipulat- 
ing the basic data, interpreting their significance and finally 
of nomenclature also has considerable influence upon the end 
results. 

Wolfle, in his admirable summary of factorial methods, states 
that Thomson, Thurstone and Tryon have “repeatedly criti- 
cized the naiveté of supposing that every factor necessarily 
represents an ultimate and unitary mental ability. None of the 
major students of factor analysis ever held such a view, but 
some of their critics have fallen into the easy error of accusing 
both Spearman and Thurstone of it because of the names they 
have given to their factors. Spearman's concept of g as the total 
fund of mental energy and Thurstone’s ‘primary traits’ and 
‘primary abilities’ are easily misinterpreted. The ordinary con- 
notations of the word ‘primary’ are such as to foster the notion 
that Thurstone has, or believes he has, isolated the basic an 
ultimate causes of differences in ability.” (Wolfle, 1940, p. 26.) 
The foregoing rebuke no doubt applies to certain of the writers' 
criticism above and elsewhere in this volume. However, the term 

naïveté” quite properly suggests that some who write and read 
about these matters—or even attempt factorial analyses on 
their own—are not “major students” of this technique and 
have in fact rather justifiably “fallen into the easy error” men- 
tioned. Ordinary connotations of such words as “primary; | 
аеро "memory? and the like do persist, and when em- 
gs without warning for esoteric purposes they naturally 
са. to misconceptions. What Thurstone, for example, tech- 
Tu means, and what all but those readers who understand 
ronde double-talk think he means, are often quite at variance. 
МИ ШУ the primaries themselves have not D 
еі” е roughout these experiments is that sequentia 
groups react differently to various elements. Thurstone 
€ 8) indeed points out: “A psychological test does not 
a а in factorial composition ; the factorial composition 18, 
ої course, dependent on the subjects, This is another way of 
stating the familiar principle that the validity of a test is not 
a fixed attribute of the test. It is a function of the criterion (the 
factors) and the population for which it is intended.” 

While this is an eminently sound comment, it seems question- 

able whether the “familiar principle” is nearly as well recog- 
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nized among educators generally, or even some psychometric 
technicians, as it should be. Unfortunately, there is all too much 
evidence in the literature on educational measurement that 
many persons either fail to comprehend this important prin- 
ciple or neglect it by implication in such statements as, “The 
validity of Test X is thus and so, its reliability is .90, еіс.?--ав 
if these were inherent characteristics of the instrument, regard- 
less of administrative conditions. 

These somewhat repetitive comments may explain why vari- 
ous changes have occurred in details of the PMA program since 
1938 and why further changes therein may be expected. The 
method of factorial analysis, when applied to educational 
measures, superficially appears so precise and mathematically 
derivative, so impersonal and for most readers so mysterious, 
that we have deliberately called attention here and in Chapter 
II (pp. 71—72) to its less exact aspects. At some stages of this 
skillful game, personal judgment seems more wide-open and 
untrammeled by objective criteria, when factors are being iso- 
lated and primary mental ability tests chosen, than when apti- 
tude or other differential measures are put to the realistic test 
of demonstrating practical efficiency. 


Some Questions of Methodology 

_ It may seem inappropriate ina work of this nature, primar- 
ily directed toward the evaluation of aptitude measures, to con- 
sider technical aspects of the Primary Mental Abilities research 
in detail. Yet other questions arise. The Battery thus far devel- 
oped by Thurstone, and its revised successors to come, are all 
intended eventually to serve as both educational and vocational 
capacity tests. The foregoing sketchy outline of their evolution 
should indicate hov—unlike other measures with essentially the 
Same objectives—these were initially produced before, rather 
than in the course of, validation against external criteria. Be- 
cause this process represents a relatively new method of select- 
mg differential-aptitude materials, even more dependent than 
its precursors upon statistical method, specific attention to the 
latter can hardly be avoided. 

Considering the scientific purity of Thurstone’s approach, 
the mathematically elaborate procedures demanded by factorial 
analysis and all of the attention given to refinement of these 
Special techniques, it seems curious that more careful attention 
was not initially paid to control of certain basic data underlying 
his research. It is axiomatic that no extension of statistical 
treatment can yield final results more quantitatively exact than 
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the original measures from which they are derived. For ex- 
ample, if individual course grades are reported in five-point 
steps (60, 65, 70... 90, etc.), it is manifestly absurd to 
carry out a student's four-year average of such marks to sev- 
eral decimal points. Yet the writer has actually seen such aver- 
ages produced by overzealous registrars, seeking to be “accu- 
rate.” Without suggesting that Professor Thurstone would 
commit such an egregious error of judgment, one may still ask 
whether his analyses do not also suggest a meticulous precision 
not really justified (for several reasons) by the fundamental 
records. Spearman (1939) has pointed out that relatively high 
probable errors of Thurstone’s fundamental correlation co- 
efficients, as reported in the first monograph, cast doubt upon 
their significance.) In view of the tremendous effort subse- 
quently devoted to the complex problem of factoring these data, 
it is surprising that they were not originally computed by a 
more careful method than that of tetrachoric correlations. 


Explanation of Tetrachorics 


To substantiate the foregoing comments and explain the 
nature of tetrachoric correlations as employed by Thurstone in 
his initial PMA study, a brief statistical interlude is necessary 
at this point. Because the procedure is seldom used in current 
educational analyses, only brief reference thereto was made in 
the chapter on statistics (p. 50). This method of comparing 
individual performance on each test with that on all others com- 
prising a series (1, 2, 3,4, ... etc.) is relatively simple, but 
crude. It merely splits the variable on each axis into high and 
low “dichotomies” and yields a coefficient calculated from the 
Proportion of cases falling within each quadrant of a scatter 
plot—i.e., high on both measures or low on both’ (positive rela- 
ship) D) ; high on one but low on the other (negative relation- 


(+) (+) (—) (+) 
(-) (-) (+) (~) 
Positive Correlation Negative Correlation 


6. Cf. also Traxler (1941b). 
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Since relative degrees of superiority or inferiority within 
each quadrant are not taken into consideration by this proce- 
dure, it obviously provides a rough appraisal at best throughout 
the total range of relationships involved. As quoted below, 
Thurstone discusses in his first monograph the reasons which 
led him to calculate all original correlations by the tetrachoric 
or “fourfold” method. At the time, he evidently regarded this 
as the most suitable procedure to follow. It must be remembered 
that, in such undertakings, the scientist is often limited by 
practical circumstances which may be decidedly less than ideal. 

Because of the great number’ of basic correlations in Thur- 
Stone's first large-scale PMA project, “It was decided to reduce 
the computational labor by using tetrachoric correlation co- 
efficients instead of product-moment coefficients. . . . In using 
the tetrachoric coefficient, we are sacrificing some accuracy, but 
we are not introducing any new assumptions into the factorial 
analysis. If the raw scores were allowed to enter directly into 
the correlation coefficients, we should have incorrect values in 
case the raw distributions deviate from normality, since we as- 
sume that these distributions are normal. . . . The computa- 
tion of the tetrachoric coefficients was made by means of facil- 
itating tables by which each coefficient can be determined in a 
few minutes.” (Thurstone, 1938b, pp. 58-59.) | 

However, in the analysis of his next experiment (employing 
Several more tests and therefore increasing still further the 
number of coefficients involved), Thurstone was able to utilize 
а. new product-moment correlation machine. Hence basic inter- 
relations for his later studies were computed by the Pearson r 
method, though only after raw test scores had been reduced to 
Single digits.) Here earlier appraisal of correlations by the 
tetrachoric procedure was first improved upon through mechan- 
ical aid in calculating more reliable coefficients ; but it was off- 
Set in turn, so far as accuracy of the final values is concerned, 
by this also rough (single-digit) scoring. Which procedure in- 
troduces the larger range of error in basic correlation data is 
questionable, although it seems rather high in either case. For 
the reader interested in reliability coefficients and intercorrela- 


T. (38-0 г 

8. “Бог convenience of handling with the tabulating-machine as the 
raw scores were transmuted into single-digit scores from which the earson 
Product-moment correlation coefficients were computed. With oe vari- 
ables there were 1,953 Pearson correlation coefficients.” (Thurstone, 1941a, pp. 
14-15.) 
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tions among the tests, Table 13 has been prepared. Approx- 
imately one-third of the tests had reliabilities between .95 and 
98, while another third ranged between .85 and .94. The spread 
of intercorrelations runs from —.25 to .84; somewhat over two- 
fifths were reported as .35 or higher. 


TABLE 13 


Distribution of Reliability Coefficients and Intercorrelations 
for Fifty-Seven Tests 


Correlation Frequency Distribution Frequency Distribution 
Intervals of Reliability of Intercorrelations 
From То Coefficients (Tetrachoric Coefficients) 
95 .98 19 
85 .94 20 
75 .84 6 8 
.65 ла 7 28 
55 64 1 117 
45 54 1 208 
35 44 998 
95 .84 394 
15 .24 304 
05 14 177 
05 .04 52 
—.15 -—.04 18 
—.25 —.14 2 
Not Reported 3 
. 57 1,596 


The above data were assembled from рр. 60, 113 and 114 of Monograph No. 1 
(Thurstone, 1938b) based upon the fifty-seven-test experiment using 940 college stu- 
- dents as subjects. The reliability coefficients had been estimated by the tetrachoric 

correlation of scores derived from odd- and even-numbered items, each series sub- 
. divided so as jointly to comprise a fourfold table. 


PRIMARIES VERSUS А GENERAL FACTOR 


With respect to internal validity or stability of the Primary 
Mental Abilities themselves and of tests developed for their ap- 
praisal, one always challenging problem in the measurement of 
human abilities and individual differences for readiness-to-learn 
15 continually present. This revolves about the nature and 
relative importance of that general intellective factor which, 
though already discussed, requires further attention. 

Thurstone’s earlier work was posited on the multifactor as- 
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sumption; ie., that mental performance could be wholly ex- 
plained by various combinations of specific (virtually wncor- 
related) abilities, with no permeating general factor like 
Spearman’s g. His second monograph, however, contains inter- 
esting observations on the apparent discovery of a “second- 
order general factor." To quote again: “Ті makes its appear- 
ance, not as a separate factor, but as а factor inherent in the 
primaries and their correlations. If further studies of the pri- 
mary mental abilities of children should reveal this general fac- 
tor, it will sustain Spearman's contention that there exists а 
general intellective factor. . . - We have not been able to find 
in these data a general factor that is distinct from the primary 
factors, but the second-order general factor should be of as 
much psychological interest as the more frequently postulated, 
independent general factor of Spearman. Our findings seem to 
support Spearman’s claim for a general intellective factor, but 
he has been so critical of our work on the primary mental abil- 
ities that it is uncertain whether he would accept our support 
for a general intellective factor.” (Thurstone, 1941a, p. 26.) 

The argument and statistical data are much too involved for 
full discussion here, but the excerpts cited obviously have basic 
implications. Thurstone’s hypothesis seems to be that primary 
abilities are largely uncorrelated among adults (college fresh- 
men) but considerably less so among eighth-grade children. He 
further states, on the page just cited: “It is now an interesting 
question to determine whether the correlations among primary 
abilities of still younger children will reveal perhaps even more 
strongly, a second-order general factor.” 

Thurstone’s rather plaintive query as to whether Spearman 
would, so to speak, meet him halfway in compromise of their 
long and sometimes acrimonious debate is interesting. Cattell 
(1948, pp. 169-170) treats this situation with notable grace: 
“The convergence of Spearman and Thurstone is now com- 
plete, barring certain diplomatic formalities. Spearman finds 
certain group factors, and Thurstone has a general factor. But 
Spearman introduces his group factors to the reader with a cold 
and perfunctory politeness, while Thurstone’s general factor 
is only permitted to enter society as & ‘second-order factor’ 
after the ‘primary abilities’ have made off with all of the actual 
test variances.” ? This argument we have earlier paraphrased. 


9. Cf. also the earlier reference, cited on p. 182 to Ruth Wright's (1989) 
demonstration that a “first-order general factor” can be obtained by Thur- 


stone’s method of analysis. This might indeed be regarded as axiomatic, pro- 


е й 
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The lay reader will be more concerned with agreement be- 
tween these authorities that some sort of general factor does 
emerge from both methods than whether—in Thurstone’s words 
again—it is “independent of the primaries” or a “factor operat- 
ing through correlated primaries.” Indeed his finding that the 
latter, after all, may be correlated seems to mark quite a change 
from his initial contention that these abilities are essentially in- 
dependent and that correlations found between measures 
thereof are perforce attributable to imperfections in the testing 
mechanism itself. 

Thurstone’s comments quoted above with specific reference 
to relationship among primaries for eighth-grade children do 
not indicate its degree. Hence Table 14 has been adapted from 
his data to summarize the extent of intercorrelation prevailing 
in the three experiments reported in Psychometric Monographs 
Nos. 1 and 2. Original intercorrelations (Section I of Table 14) 
among the primaries range from —.18 to .55. The supposedly 
pure primaries (Section II of Table 14) seem to correlate from 
-14 to .84 with the common general factor. Section III shows 
the slight amount of relationship remaining between primary 
factors after the degree of correlation attributable to a common 
element (or general factor) has been extracted. 

Another way of describing these coefficients is to consider 
them as estimates of the relationship to be expected between the 
separate factors if there had been no common element (instead 
of their being partially interdependent through some g factor). 
It is only after influence of this general factor has been removed 
that intercorrelations among the primaries show real independ- 
ence, as indicated by their range from —.10 to .11. This is 
merely another way of stating that “the single factor accounts 
for most of the correlations between the primary factors.” 
á m A ни оға general factor in Thurstone’s data for 

Е 1 grade children is clearly &pparent. It seems to be more 
ends elated to Factors W, V and R (or I) than to the other 
primaries. This is evidenced by the relatively high correlation 
raid ranging from .62 to .84 (see Table V, Appendix B) 
г жендет p factors and g. Although he may still not accept 

] 0], a general factor of some kind has been found which 
pays an Important part in the composition of primaries thus 
far isolated. But the nature of g, however identified, remains 


vided one began with Search for a gene as Holzinger and Harm 
5 general factor, Ing; 
presumably did in their analysis of rst , 
à 1 E у Of Thurstone’s basic data, reported o 
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TABLE 14 


Summary of Correlations Involving the Sia Primaries and the 
General Factor 


Average Range of 
Coefficient Coefficients 


I. Intercorrelations among the primary fac- 
tors 
a. Fifty-seven-test experiment with 240 


college students. 02 —.18 to .18 
b. Sixty-test experiment with 710 eighth- 
grade students. 95 08 to .43 
c. Twenty-one-test experiment with 437 
eighth-grade students. .96 115 to .55 
II. Correlations between the general factor and 
primary factors 
d. Sixty-test experiment with 710 eighth- 
grade students. 50 ЛА to .72 
e. Twenty-one-test experiment with 497 
eighth-grade students. .60 .94 to .84 
III. Correlations among primary factors after 
removal of the influence of the general 
factor 
f. Sixty-test experiment with 710 eighth- 
grade students. .00 —.08 to .11 
0. Twenty-one-test experiment with 437 
.00 —.10 to .10 


eighth-grade students. 


The above data were abstracted from Table 8, p. 100, Primary Mental Abilities 
(Thurstone, 1938b), and from Table 7, p. 25, and Table 13, p. 37, of Factorial Studies of 
Intelligence (Thurstone, 1941a). A more complete and detailed analysis, substantiating 


the table above, will be found in Appendix B. 


something of a mystery. Thurstone’s own words well describe 
the situation: 

“This finding raises the interesting question whether a unique 
general factor can be determined. Its interpretation here would 
be that the primary mental abilities are correlated by a general 
factor which operates through each of the primaries. Each of 
the primary factors can be regarded as a composite of an in- 
dependent primary factor and a general factor which it shares 
with other primary factors. The psychological interpretation 
of the general factor must be only tentative at the present time. 
We must await much data from further investigation to deter- 
mine whether the second-order general factor is maintained in 
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repeated experiments, whether it can be found also for adult 
subjects, and what psychological or physiological interpreta- 
tion may be given to it.” (Monograph No. 2.) 


New Aspects of the Primary Mental Abilities Research 


On the whole, Thurstone seems to have made a volte-face 
more striking perhaps to others than to himself: e.g., “We have 
not found any occasion to take sides as regards the existence 
of a general intellective factor.” (Idem, 1941a, p. 26.) That 
may well be. Yet readers of his earlier work,. perhaps errone- 
ously, have scarcely gained such an impression. Another change 
of attitude evidenced in the second monograph pertains to the 
recognition of response-speed as in some sense a mental factor. 
This point warrants further consideration since (at least in the 
present writer’s opinion) it might account in part for progres- 
sive disclosure of general versus differential ability, as age 
ranges are lowered. Collateral evidence to that effect appeared 
in the preceding chapter, where overspeeding was suggested as 
one reason for somewhat unsatisfactory differential prognosis 
afforded by the Yale Aptitude Battery at secondary school 
as compared with college freshman levels. 

In passages already cited, Thurstone refers to other re- 
searches in progress or contemplated (vide 1944). These may 
well lead to increased discrimination by the PMA tests, espe- 
cially since he has now accepted both the possibility of a gen- 
eral factor (according to the evidence in Table 14, quite highly 
correlated with most of the independent ones) and also the im- 
portance of perceptual speed.!^ Thurstone also indicates that 
“a single average index of mental endowment can easily be ob- 
tained by taking the average of the six measures on the profile.” 
(Idem, 1941a, p. 8.) However, he naturally favors use of the 
Separate Measures expressed in profile form as superior to a 
general intelligence index, since escape from domination by the 
latter concept is a particular aim of modern differential prog- 
nosis. In the light of validation studies later discussed in the 
present chapter, it is precisely on this point that doubt arises 
as to whether Thurstone’s battery has yet achieved the intended 
purpose. . 

Apparent shortcomings thus far may follow necessarily from 
the nature of primary abilities, in so far as certain of them are 


10. Cf. Thurstone (1944) and (1945, p. 7) where the following statement is 


amplified: *Five of the factors revealed in the perceptual functions are con- 
cerned with speed." 
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demanded in some degree by all intellectual pursuits. Theoreti- 
cally independent they may be, but functionally they remain 
interwoven, operating as teams in execution of any but the 
simplest mental tasks. When these are reduced to such a degree 
that word fluency can be distinguished from verbal comprehen- 
sion, or several different types of memory isolated, the testing 
situation ceases to be realistic in educational or practical terms. 
Yet that same possibility was advanced, seemingly as desirable, 
in the following statement: “What we have called the ‘complex- 
ity’ of each test should be reduced. It seems likely that such 
improvements in psychological tests will mark a new line of 
development. Instead of improving a composite test by raising 
its correlation with some equally complex practical criterion, 
such as academic scholarship, the tests will be improved by 
making them relatively pure measures of the primary abilities. 
In general, this will make the tests look simpler, and in some 
cases the test will appear to be remote from the practical ac- 
tivities that psychological tests are sometimes made to simu- 
late.” (Monograph No. 1, рр. 92-93.) _ | j 

If the present chapter seems unduly discursive or compli- 
cated, this is at least partially due to the writer's difficulty in 
adjusting to successive changes in Thurstonian dialectic. His 
latest published statement, at the present time of writing, in- 
cludes the following paragraph: 

“Tt is an old observation about intellectual tasks of all kinds 
that the correlations are all positive. If two widely divergent 
mental tasks are considered, the correlations between them may 
be low but the association is always positive. In fact no negative 
correlations have ever been found for intellectual tasks. There is 
a rather common misconception about these relations. It is not 
infrequently asserted that those who are superior in one intel- 
lectual task are somehow inferior in some other intellectual task. 
Among students such an impression is not uncommon as regards 
linguistic and scientific abilities. The fact is that no negative 
correlations for performance in school subjects or in intellectual 
tests have ever been found.” (Thurstone, 1945, p. 4.) , 

These highly significant remarks seem rather at variance with 
the former emphasis (in Monograph No. 1 and earlier or con- 
temporary papers) upon virtual independence of certain mental 
tasks and particularly of the factors which these employ. 
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THURSTONE’S EXPERIMENTAL PRIMARY MENTAL 
ABILITIES BATTERY 

Before proceeding further to discuss Thurstone’s important 
research on primary mental factors or abilities as set forth in 
his two monographs and elsewhere, it is advisable briefly to 
describe the nature of his experimental battery (1938c). This 
consisted of sixteen tests, yielding seven differential factor 
scores. The titles of these separate tests, time limits for each, 
and the manner in which they were variously teamed to produce 
composite indices of the several factors are given in Table 15, 


TABLE 15 
. Composition of the Tests for Primary Mental Abilities 
1938 Experimental Edition (16 Tests) 1941 Edition for Ages 11 to 17 (17 Tests) 
Symbol and Time Limits|Symbol and Time Limits 
Factor Tests (Minutes)* | | Factor Tests Qfinutes)* 
N—Number Addition 8-7 |N—Number Addition 3-6 
Multiplication 2-7 Multiplication 3-6 
"Three-Higher 5-6 
V— Verbal Completion 4-5  |V—Verbal Completion 3-6 
Same-Opposite 2-5 Sentences 3-5 
Vocabulary 8-4 
S—Space Cards 6-19 |S—Space Cards 6-5 
Figures 4-11 Figures 6-5 
Flags 10-5 
M—Memory Initials 5-1 |M—Memory First Names 8-13 
Word-Number 5-12 Word-Number 3-12 
I—Induction Letter Grouping 6-8  |R—Reasoning Letter Grouping 7-4 
Marks 10-8 Letter Series 6-6 
Number Patterns 6-8 Pedigrees 5-6 
D— Reasoning Arithmetic 2-20 |W—Word 
Mechanical Fluency First Letters 3-5 
_ Movements 3-15 Four Letter 
Number Series 4-10 Words 3-4 
P R Suffixes 3-4 
— Perception Identical Forms 3-8 
Verbal 
Enumerations 9-6 
Total Time 68-153 |Total Time 75-102 


Data for this table were abstr: 
Table 1, p. 8). 


* The first entry shows time limit for the fore-exercise (practice period) and the 
second, time limit for the test proper. 


acted from Thurstone (1938c, Table 1, p. 4, and 1941b, 
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where they are compared with analogous elements of his later 
(1941) Test Battery. Estimated reliability and validity of 
these batteries is shown in Table 16. At this point, it will suffice 
to state that individual factor scores were originally reported as 
profiles in the following order (Op. cit. 1988а) : 


Р Perception 

N Number Facility 

ү Verbal Comprehension 
S Spatial Visualization 
M Memory 

I Induction 

D Deduction 


It is difficult to understand just what significance the fore- 
going serial order has. Unlike that for aptitude test profiles 
illustrated in the preceding chapter, it does not “read from left 
to right" in terms of meaningful, generally recognized arens, 
whether educational or vocational. Neither does it accord with 
the successive revelation of factors, as described in Monograph 
No. 1. Perhaps, because these basic factors are assumed to be 
independent, the order is deliberately random and nonassociated 


with familiar (educational or other) patterns. 


TABLE 16 
Estimated Reliability and Validity of the Composites 


1938 Experimental Edition (16 Tests) 1941 Edition for Ages 11 to 17 (17 Tests) 
Symbol and Estimated Symbol and Estimated 
Factor Reliability Estimated Factor Reliability Estimated 
(Composite (Split-half. Validity* (Composile (Split-half Validity 
Score) method; H.S. | (H.S. Sen- Score) method; (Grades 6- 
Seniors) iora) Grades 6-12) 12) 
N—Number .99 .63 N—Number .96 to .98 90 
V—Verbal .96 68 V—Verbal 95 to .97 97 
Meaning 
S—Space 99 т 5--брасе .96 to .98 9% 
М--Мешогу 99 .68 M—Memory | .63to.82 79 
I—Induction 187 151 R—Reasoning| .96 to .97 .90 
D— Reasoning 84 58 —|W—Word Not 91 
P—Perception .99 .65 Fluency Obtainable 


Data for this table were abstracted from Thurstone (1938c, Table 1, p. 4 and 1941b 


Tables 1, 9, 12, pp. 8, 29, 30). 

* “Estimated validity" is а theoretical correlation between the composite score 
(based on the two or three specific tests which make up the composite) and a hypo- 
thetically “риге” primary factor. "This has no relation to external criteria. 
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Before proceeding. with detailed consideration of Thurstone’s 
experimental data and results, it is again necessary to discuss 
the problem of nomenclature. Thurstone has variously em- 
ployed such terms as “primaries,” “primary factors," “bat- 
tery” or “composites” to mean somewhat different things, not 
always clearly distinct, as severally relating to his basic test 
series and to the more limited group of resultant “service” 
measures. That is only natural, where sequential reports deal 
with separate aspects of a total long-range investigation. 

The remarks above, therefore, are in no sense meant as dis- 
paraging Thurstone’s terminology, but merely indicate the 
complications which ensue when pertinent excerpts are neces- 
sarily cited outside of their full context. Even the word “experi- 
mental,” safeguarding his first PMA Battery (1938 edition) 
offered for general trial, has a different connotation there than 
it does when applied to the underlying data from which the 
battery was assembled. The fact that earlier composite PMA 
measures were indeed regarded as experimental is proven by 
changes in the nature of Thurstone’s subsequent battery. He 
described plans for this 1941 edition as follows: “Tt is our pur- 
pose to make available, as soon as possible, a battery of psycho- 
logical tests so designed that there will be three tests for each 
of six primary mental abilities. Each set of three tests will be 
called a ‘composite test,’ and there will be one composite for each 
of the primary mental abilities. The tests for each primary are 
self-contained, so that each primary factor may be appraised 
independently of the others.” ('Thurstone, 1941a, p. 8.) In 
stating that identical terms have somewhat different meanings 
according to the total setting in which they appear, we merely 
wish to emphasize the importance of specific notations. Thus, 
m comments drawn from either monograph, “battery” refers 
to the entire assemblage of tests (e.g., 57 initially) whose inter- 
correlations, when factored by Thurstone’s method, led to 
emergence of his mental primaries. 

К е; from those data a smaller group of sixteen tests was 
entatively selected for practical trial and became the first ex- 
DM battery (Thurstone, 1938c). Here the meaning of 
; NE terms, as employed for example in the student-report 
orm, is altered by their new context—as indeed is functioning 
of the sixteen measures themselves, when transplanted from 
earlier surroundings—i.e. the garden of 57 varieties in which 
their respective factor loadings originally grew. Composite 
Scores yielded by either the 1938 or 1941 PMA Battery (using 


Unitary Traits and Primary Abilities 197 


that term now in its restricted service connotation), after re- 
analysis in their new setting, were more highly interrelated than 
the same elements had been earlier." 

Thurstone predicted this outcome in advance and has re- 
peatedly said (а) that primaries were a demonstrable outcrop 
only of particular testing situations and (b) that the fewer 
instruments used (in whatever combinations) to measure these 
primary factors, the less exact their resultant appraisal would 
be. Many readers of the literature on this topic nevertheless 
seem neither to grasp his cautionary injunctions nor to recog- 
nize the wide difference between theoretically independent men- 
tal factors and the much less pure, composite PMA test scores, 
which also bear a factorial label. 

A single booklet edition of the PMA battery has been pub- 
lished and distributed by Science Research Associates of Chi- 
cago under special arrangement with The American Council on 
Education. It is entitled The Chicago Tests of Primary Mental 

. Abilities (Single Booklet Edition for Ages 11 to 17). This 
represents a shortened form of the previous edition, five tests 
having been eliminated from the 1941 battery, leaving the re- 
maining twelve tests completely unaltered. According to the 
Manual of Instructions (Thurstone, 1943c) scores on the same 
six factors, N, V, S, M, R and W, are obtained in a little less 
than two hours in contrast to the nearly three hours previously 
required. Over forty per cent of the working time is devoted to 
practice periods. The Manual contains no new data on either 
reliability or validity. For these reasons, discussion of the PMA 
battery in this chapter is limited to the basic 1988 and revised 
1941 editions. 

PRELIMINARY REFERENCE TO VALIDATION OF 
THE PMA TESTS 

For guidance purposes, our major concern is with validity 
of the Primary Mental Abilities Battery as a testing mecha- 
nism, in comparison with that of the abilities themselves as fac- 
torial independencies or psychological entities. We have said 
that Thurstone’s position in some respects has been misunder- 
stood: e.g., that he does not consider the tests thus far developed 
sufficiently pure to measure the factors ideally. Indeed, as will 
appear in later discussion, they fall in varying degrees con- 
siderably short of this ideal. The PMA Battery enjoyed a con- 

11. Of. Thurstone (1941b, Table 10, p. 29 and 1941a, Tables 7 and 13, pp. 25 


and 37). This trend was summarized in Section I of Table 14, page 191, of the 
Present chapter. 
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siderable build-up among educators in advance of publication; 
hence the results obtained from its initial trials, through not 
coming up to certain rather naive expectations, have perhaps 
been too severely criticized. In part, such criticism represents 
a justifiable reaction against ultrafavorable analyses of data ob- 
tained from early trials of the battery. For these too sanguine 
expectations or claims its authors were by no means wholly re- 
sponsible: indeed they consider even the present six-factor 
forms as still experimental. 

Thurstone’s attitude in this respect is clearly set forth in his 
"Individual Report on Primary Mental Abilities Test,” the 
blank utilized for recording student performance. In discuss- 
ing each of the seven? factors on this individual report form he 
carefully avoids any implication that they can yet be considered 
valid for educational counseling purposes. То quote briefly 
therefrom: “The mental abilities are probably very numerous. 
It will require many years of research to discover most of the 
important abilities and to develop tests for them. Seven’? mental 
abilities which have been isolated to date are represented in the 
tests that you have taken. Your individual ratings on these 
seven factors are shown in the profile in this folder. . . . It is 
too early in these investigations to make any definite statements 
about the particular combinations of abilities that are called 
for by each vocation or to attempt to make individual vocational 
prognoses by these scores.” (Thurstone, 1988а, pp. 1-2.) 

lsewhere, in referring to this 1988 experimental edition, 
Thurstone states: “An experimental edition of tests for seven 
primary mental abilities was made available in response to a 
rather general interest in the problem of isolating mental abil- 
ities. We made the stipulation that these tests were not to be 
distributed as service tests, since they were experimental, and 
the first edition of the test forms was so designated. A number of 
improvements have been made in some of the tests since the ex- 
perimental edition was printed, so that the first edition will 
soon be revised,” (Monograph No. 2, pp. 1-2.) 

As one step already noted in this revision, the number of 
abilities which the PMA tests now measure with considerable 
assurance has been reduced to six in the 1941 and 1943 edi- 
tions. Despite the disarming admission just quoted from the sec- 
ond monograph, Thurstone had earlier gone so far as to say: 

The factors that have been identified in this study are nearly 
uncorrelated. By this we mean that inter-correlations are near 
12. Later reduced to Six, as already noted. 
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zero, with one conspicuous exception . . . visualizing and 
number. This inter-correlation is about .40. . . . By these tests 
it is now possible to describe each individual in terms of at least 
seven indices which should replace the intelligence quotient, 
mental age and other gross scores of general intelligence." 
(1986, p. 133.) 

"Therefore undue hopes as to what the experimental battery 
would accomplish cannot be attributed entirely to other people's 
optimism. 


Comparison of the Yale Educational Aptitude Test Battery 
with That of Primary Mental Abilities 


Although the two are by no means alike in nature, the PMA 
and Yale batteries have analogous objectives in respect to dif- 
ferential measurement and guidance. Hence a brief comparison 
between them seems appropriate at this point. 

Following factor analysis of many diverse, individually short 
tests, certain of these were combined to form Thurstone’s series, 
its purpose being to measure theoretically quite disparate or 
independent mental traits. The Yale series was developed more 
ad hoc, with the admittedly utilitarian aim of trying to measure 
relative aptitudes for three broad areas of study in which the 
upper undergraduate schools of Yale University offer diver- 
gent educational opportunities. These may roughly be desig- 
nated as the liberal arts or social sciences, pure science or mathe- 
matics and applied science or engineering. Some composite 
parts of Thurstone’s Battery (e.g., Number, Verbal and Space, 
as earlier described) seem from their titles to have considerable 
similarity to certain elements in the Yale Battery; specifically, 
SAT (Verbal), MAT (Mathematical) and Spatial Relations. 
The other Thurstone composites, Memory, Reasoning and 
Word Fluency, have no direct counterparts in the Yale Battery, 
whose remaining four tests are Artificial Language, Verbal 
Reasoning, Quantitative Reasoning and Mechanical Ingenuity. 
The latter includes, by permission from Professor Thurstone, 
a few Mechanical Movement items employed in his earlier in- 
vestigations and now represented in his PMA series. 

Bearing these similarities and differences in mind, Table 17 
has been prepared showing the intercorrelations characterizing 
each battery. The range for PMA composites is (based on 1,000 


18. Since this was written, a reorganization of the undergraduate schools at 
Yale has been effected; consequently their future “line-up” will differ some- 
what from that herein represented. 
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subjects, age 10 to 18) from .18 to .59; their average, .36. In- 
tercorrelations for the Yale Battery (based on 856 Yale Fresh- 
men, Class of 1944) range from .19 to .64, the average being 
41. As might be expected from the procedure and the broad 
rather than pure objectives, Yale intercorrelations are some- 
what higher than those reported by 'Phurstone. In a sense, the 
criterion which Thurstone used to develop his battery was the 
minimization of intercorrelations, while the Yale criterion was 
the reliability of separate tests and their external validity with 
respect to differential scholastic records of college freshmen. 
Overlapping among criteria in the Yale situation (grades in 


TABLE 17 


Intercorrelations Among Thurstone’s PM A Composites Com- 
pared with Intercorrelations Among Yale Tests 


A. THURSTONE’S PMA BATTERY * 
NV S M 


N 
W 41 
У 40 .54 


Se 298 716 
M 81 36 .85 18 
Е 53 49 .59 29 .39 


Range of Correlations: .19 to .59; Average: .36. 


B. YALE BATTERY} 


I II ІП ТУ; V VI 
SAT ALT (VRT) (QRT) (МАТ) (SVT) 


ISAT 
II ALT 41 
III VRT 64 41 
ү чү 80 49 51 
20 82 32 ‚69 
VISVT 19 98 38 51 49 
VII MIT 94 32 44 61 150 155 


Range of Correlations: .19 to .64; Average: .41. 


3 Thurstone (1941b, Table 10, р. 29). This is the seventeen-test battery administered 
to approximately 1,000 subjects in each half-year interval through ages 10 to 18. The 
composite N score is the sum of the scores on the three Number tests, etc. 

T Yale data (see Table 8, p. 157) based on Yale Class of 1944 as freshmen. Corre- 


sponding data for other freshman classes show little variation from the pattern rep- 
resented above. 
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various courses) was bound to operate toward inter-relationship 
while Thurstone’s method was directed against it. Under the 
foregoing circumstances, it seems rather surprising that the 
general level of PMA intercorrelations so nearly approximates 
that of the Yale Battery. In making such a comparison, how- 
ever, we must bear in mind that Thurstone’s subjects on the 
average were considerably younger than those at Yale. 

Because the Yale tests are but seven in number, and each is a 
composite by nature, factorial methods are not particularly ap- 
propriate to their analysis. It would be desirable to undertake 
such analyses individually throughout all subtests comprising 
the seven major parts of this battery. Wartime conditions and 
the immediacy of other problems have postponed that under- 
taking. 

Several factorial analyses (respectively derived from sec- 
ondary school and college freshman populations) have never- 
theless been made, based upon intercorrelations among total 
scores for each of the seven educational aptitude tests. Results 
of these two analyses are closely similar and reveal three dif- 
ferential factors: Verbal-Linguistic Facility and Reasoning; 
Quantitative-Mathematical or Scientific Reasoning; and Spa- 
tial-Mechanical Aptitude; plus a high residual factor which 
perhaps can best be described as “general scholastic ability.” 

These labels are no less arbitrary than others referred to in 
this chapter and earlier. Their characterization of objectives 
which the Yale Battery was designed to measure are obvious. 
In other words, the analyses (by Thurstone’s method) statisti- 
cally. yield those three major composite (not pure) factors 
which from the beginning had been sought in development of 
this battery and one of general scholastic aptitude, naturally 
well established at the higher college-preparatory and Yale 
freshman levels. 


VALIDATION OF THE PMA BATTERY 

Consideration will now be given to studies thus far reported, 
bearing upon external validity of the Primary Mental Abilities 
Battery. It should be borne in mind that, when Thurstone em- 
ploys the term “validity” in his monographs, this refers to cor- 
relation of the separate tests with internal criteria ; i.e., the 
several primary factors themselves. This theoretical sort of 
validity is represented in the preceding Table 16. The data 
now to be presented deal with the more usual concept of valida- 
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tion: a measure of correspondence with some external frame of 
reference, such as differential performance in academic or vo- 
cational fields. 

Тһе first investigation of this sort, reported by Shanner 
(1939), dealt with results obtained from trial administration 
of the PMA Battery to boys in grades 11 and 12 at a well- 
known Eastern private school. Data were presented as to (а) 
intercorrelation among the test composites designed to measure 
the several primary abilities and (5) correlations between such 
test scores and objective (Cooperative T'est Service) indices of 
subsequent curricular achievement. Тһе present writer has else- 
where discussed Shanner's article and ventured to draw con- 
clusions less favorable to this battery than Shanner had from 
the same data. (Crawford, 1940.) Since the arguments pro and 
con (including the latter's further comment or rebuttal, after 
he had read those criticisms in proof) are set forth in the article 
just cited, they will not be reproduced here. 

In the writer's opinion, this experimental tryout of the 
Primary Mental Abilities Battery yielded no convincing evi- 
dence of its practical validity as a differential-testing instru- ' 
ment, The intercorrelations among supposedly disparate sec- 
tions were higher than one had been led to expect from advance 
notices, including statements on the student's individual report 
form referred to earlier. Moreover, the seven Primary Scores 
showed little discriminative power for predicting subsequent 
grades in contrasting secondary courses. Shanner did find quite 
high relationship between Verbal Comprehension (V Factor) 
and subsequent Cooperative Achievement Test ratings in Eng- 
lish or history ; somewhat lower but still impressive coefficients 
for scores on Factor D (Deduction)!* with respect to analogous 
later performance in mathematics and to a lesser degree in the 
Sciences. Yet these two PMA measures, especially V (like some 
of those comprising the Yale Battery in operation at grade 
levels 10 and 11 at college preparatory schools) yielded dis- 


ы E to its seemingly aberrant nature, scores for this factor have since 
D has ME at least temporarily, from the battery. "The deductive factor 
it might 3 indicated in several studies, but it has not always appeared where 
rm E inve been expected. This factor should, therefore, be regarded 45 
entative and Subject to reinterpretation if it can be found in clearer form in 
repeated studies. In revising the experimental test battery for the primary 
mental abilities we shall omit this factor because it has not been maintaine 

in repeated studies. Further study of the tests in which it has been indicated 
шау give some new interpretation for the primary factors involved, which 
Should be tested with specially designed tests. It seems clear now that our 
first interpretation of this factor was erroneous." (Monograph No. 2, pp. 6-7.) 
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concertingly similar positive correlation with all of the criterion 
subjects. Hence one can hardly regard them, from this evi- 
dence, as operating differentially. Nor in this experiment do 
the other PMA tests show any better discrimination; in fact, 
most of them yield quite low coefficients throughout. 


Adkins’ Investigation of Primary Abilities among Professional 
Students 


Another validation study, employing certain external cri- 
teria as a test of this battery’s differential utility, was under- 
taken for the American Council on Education at ten universities 
in the spring of 1939. Its chief purpose was to determine the 
relationship of primary ability scores to earlier vocational 
choice, as represented by the subjects’ respective fields of pro- 
fessional study. The program essentially rested on a logical 
assumption that different (advanced) concentration areas call 
for, and emphasize in varying degrees, distinctive mental- 
ability patterns. | 

Major findings of this investigation, reported by Adkins 
(1940), include a series of “vocational group profiles,” repre- 
senting the relative test performance of students at the post- 
graduate level (or, in a few instances, college seniors) specializ- 
ing in one of twelve professional fields. While the number of 
cases comprising some of these contrasted groups was small, 
certain interesting differences were obtained. Students of 
mathematics and chemistry made high scores in general— 
notably on factors №, У, S, I and D; next on P and least on M. 
These two vocational profiles have much in common, as do those 
for specialists in physics and engineering (highest for each 
group on N, S and D). Analogous peaks may be noted for 
business administration and accounting majors m Number 
(N); for advanced students of romance languages, history, 
law and journalism in Verbal (V) scores. Тһе Medical profile 
Shows superiority throughout; highest for factors P, V, and 5, 
although none of these approaches the best level obtained by 
other students on some Primary scores. 

Adkins (1940, p. 50) states: "The fact that there are no 
mean scores for this [Medical] group which tower above the 
general level is consistent with our difficulty in attempting to 
predict which abilities would be highest for this group. Ap- 
parently medicine is such a broad field that the possession of 
no one outstanding ability is necessary. Given a level of gen- 
eral ability somewhat above average, а person with sufficient 
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interest in the field can be expected to find a branch wherein he 
can achieve at least moderate success.” 

This comment as to “moderate success” perhaps reflects the 
fact that relative success of these professional students among 
their respective groups (1.е., scholastic ranking) was not con- 
sidered in this research. To cite Adkins again: “although voca- 
tional choice admittedly is not a direct index of success, it was 
felt that students pursuing a given choice to the extent of 
taking graduate work could be assumed to have exhibited a 
degree of success in that field. We therefore may expect our 
approach to reveal to a certain extent relations of the ability 
composites to success in a field as well as to mere choice of the 
field or interest in it.” (Idem, p. 45.) 

Her subsequent remarks on “Similarity Among Groups” and 
conclusions merit still further selective quotations, as follow: 

“In the course of the discussion above, we have noted certain 
similarities among profiles for some of the vocational choice 
groups. The mathematics and chemistry groups are quite sim- 
ilar; on the basis of these profiles, it would be impossible to ad- 
vise a student with a similar profile to choose one of the two 
fields rather than the other, unless additional factors! were 
taken into consideration. On the basis of the profiles alone, one 
could not presume to advise a student to choose between physics 
and engineering. A person with a profile similar to that for 
physics or engineering might be very successful in certain 
branches of mathematics or chemistry. 

“Another grouping of vocational choices on the basis of simi- 
larities in the profiles is accounting, business administration, 
and pharmacy, each being fairly high on Number and not strik- 
ingly high on any other composite. . . . 

“A third grouping of vocational choices consists of those high 
on the Verbal factor and low on Space, Memory, Induction, 
and Reasoning: romance languages, history, law, and journal- 
ism. No student could make a choice from among these voca- 
tions on the basis of his ability profile alone. The pattern of 
abilities for medicine is unlike any of the others. It is our belief 
that the choice of medicine as a vocation cannot be made on the 
basis of an ability profile, at least not until we discover some new 
abilities more prognostic of success in that field. . . . 

_ “A caution as to the use of these and similar results in voca- 
tional counseling seems to be in order. The current tendency to 


15. The term "factors" is probably here employed in a general rather than 
technical PMA sense, 
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go beyond the data in utilizing test results for vocational guid- 
ance is to be deplored. . . . We know that factors as yet un- 
measurable contribute to success in almost any field. We must 
realize that one deficient in a certain ability may be able to 
compensate for this lack by substituting other abilities which 
point to success in any single field; and we know of no one vo- 
cational field for which there is one, and only one, pattern of 
abilities conducive to success.” (Adkins, 1940, pp. 52-53.) 


A Correlation Study in Professional Schools 


Some two years later, Stuit and Hudson (1942) published 
results of an investigation made at the University of Iowa along 
these same lines. This, however, specifically takes into account 
the relationship between PMA test scores and comparative 
achievement as measured by classroom grades in three profes- 
sional areas (engineering, journalism and medicine). Among 
other data these authors report the Pearson r coefficients shown 
in Table 18. 


TABLE 18 


Correlations Between Factor Scores and Grade Point Averages 
in Three Professional School Groups* 


Groups Factor Symbols 

prm we qa “ас M I D 
Engineering i50 .998 .577 .178 .568 .400 .385 
Journalism 497 .918 .505 .015  .939 .891 .057 
Medicine 1858 .179 151 .098 —.013 —.219 149 


* Adapted from Stuit and Hudson, 1942, p. 180. 


With admirable restraint, the authors of this interesting 
study characterize the foregoing correlations as somewhat con- 
fusing and illogical.” For example, V and M yield surprisingly 
high, and S amazingly low, respective coefficients with grade 
point averages among prospective engineers. As they point out, 
however, the latter result at least is probably attributable to 
high selectivity among the engineering group of students facile 
іп spatial visualizing. Ав repeatedly emphasized in preceding 
and later chapters, correlation coefficients under circumstances 
of this nature can be markedly lowered by restriction in range 
of test results and subsequent performance alike. Despite this 


influence, however, spatial visualizing scores usually show much 
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higher relationship with grades in engineering than is reported 
above. 

Except possibly for P (which, as earlier noted, Thurstone 
has found “troublesome” and probably affected by speed of 
response), Primary Ability scores throughout the battery, in 
this experimental situation, showed no significant correspond- 
ence with relative accomplishment in medical studies. V and P 
appropriately perform best with respect to grades in journal- 
ism, but the picture as a whole remains indeed “confusing and 
illogical.” 

Although Stuit and Hudson employed both the multiple cor- 
‘relation technique (utilizing two or more tests in a best- 
weighted combination) and profiles much like Adkins’ to char- 
acterize their respective, contrasted groups, the results obtained 
could properly justify no stronger conclusion than follows: 
“The results of this study, as well as those reported by other 
investigators, suggest that tests for primary mental abilities 
will have some value in educational and vocational counseling. 
Whether they will replace present vocational aptitude tests 
remains a problem for further research. No doubt many of 
these aptitude tests measure the same or nearly the same funda- 
mental abilities.” (Idem, p. 182.) 

The foregoing comment may be regarded as carefully sum- 
marizing data thus far obtained in attempted validation of the 
Primary Mental Abilities tests for educational prognostic meas- 
urement and guidance purposes. One other study may be men- 
tioned briefly; that by Goodman (1944) who surveyed the 
validation of Thurstone’s PMA Battery at Pennsylvania State 
College. PMA tests were reported to correlate on the whole as 
well as most’ standardized intelligence tests with criteria of col- 
lege success. Little evidence of high differential validity was 
found. Anyone sincerely interested in student counseling must 
hope that the battery, through further trial and revision, even- 
tually will prove of practicable usefulness. Thus far, little if 
any evidence demonstrates its validity in these respects. 


FURTHER CRITICISMS OF THE PRIMARY MENTAL 
ABILITIES BATTERY 


If the Writer is correctly informed, no detailed item analyses 
(determining the correspondence of answers to each separate 
question with total scores on the test of which it was a part) 
were made by Thurstone in the course of early experiments. Yet 
that procedure has long been regarded as essential to the pro- 
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gressive selection and scaling of test elements. It serves not 
„опу to eliminate undesirable (nondiscriminating or ambigu- 
ous) questions, but also to facilitate arrangement of those re- 
tained in order of their respective difficulty. The importance of 
some such method of evaluating specific items increases directly 
with the limitation on testing periods; i.e., the shorter time al- 
lowances are, the more important it becomes to validate a meas- 
ure’s individual components. Hence the seeming disregard of 
these well-known principles throughout this research is all the 
more striking, because time limits for every part of the PMA 
Battery are exceedingly brief. Its 1938 and 1941 forms con- 
sist of sixteen and seventeen subtests respectively, as described 
elsewhere in this chapter (‘Tables 15 and 16, pages 194 and 
195). Following a short fore-practice or warm-up period, the 
allowance for each separate measure is less than six minutes on 
the average, for the 1941 and 1943 editions.!? 

Thus every section demands a race against time among items 
which have not been scaled in discriminating power. 'Thurstone 
has, to be sure, progressively revised his test materials and even 
his interpretation of the primaries, as already noted. There is 
no published evidence, however, that individual questions were 
scrutinized or reselected by customary methods of item analy- 
518.17 Under these conditions, discordance among even a few of 
the item responses to any instrument timed for such an abbrevi- 
ated schedule may seriously pre, judice its reliability. 

"This matter of time available is one of prime importance and 
should be kept in mind throughout all comparable discussion of 
differential measures. Taken alone, these might be so adminis- 
tered as virtually to eliminate the speed factor; but when a 
Whole series of tests is given within a limited total period, the 
reduction of time limits for their separate parts becomes almost 
inevitable. For this reason, one might say that differential bat- 
teries (whether of the aptitude or the PMA type) have never 


the 1941 battery, according to its administration 
minutes. Over forty per cent of this total is de- 
which do not contribute directly to the scores. 
ere are 17, are allowed only 102 minutes in all. 
o 6 minutes in length. Cf. Table 


16. Total testing time for 
Schedule, is 2 hours and 57 
Voted to forepractice exercises, 
Actual test periods, of which thi 
Most of the individual sections are from 4 t 


15, p. 194. 
17. The writer recalls а “round-table” discussion at which Professor Thur- 


Stone explained (not to the entire satisfaction of certain critics present) why he 
deemed such analysis of separate items unnecessary on the grounds that he 
Was primarily concerned with the operation of total scores on various tests 
and their factorial significance. It is possible that measures have since been 
taken, or are now under way, to remedy this apparent shortcoming in earlier 
technique. Cf. Stalnaker, 1939a. 
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yet had a full chance to demonstrate their possibilities under 
practical working conditions, not just experimentally. Shorten- s 
ing of time allowances between the earlier experimental, and 
subsequent widely available, forms of the Thurstone tests has 
probably so speeded the latter as to impair their factorial 


purity. Stalnaker’s article cited above supports that impres- 
sion. 


THE SPEED FACTOR AGAIN 


Several references have been made to response-speed as a pos- 
sible factor and to its effect upon test scores. Thurstone’s one 
term for this postulated mental attribute—“perceptual speed 
—seems a little restricted when one considers the sheer me- 
chanics of expressing the results of perception on separate an- 
swer sheets. At any rate, speed of response may account in 
large degree for some of Thurstone’s latest findings and per- 
haps for the lack of discriminating power which his battery thus 
far has demonstrated in actual purposes of educational meas- 
urement and guidance. 

It seems at least conceivable that specificity of performance; 
whether denoted as an aptitude or as a primary, under the pres- 
sure of speed becomes obscured and replaced by a spurious gen- 
eral-performance factor (comprising perceptual intelligence 
and facility in utilizing it). By this we mean celerity of re- 
Sponse and of adaptation to different problems, especially in 
any short-time testing situation. There seems no reason to re- 
gard this general factor as less influential than either memory 
Or reasoning; во far as recorded ‘scores ате concerned. Among 
several references bearing upon this point, recent studies by 
Traxler (1938), Flanagan (1938) and Stalnaker (1939b) in- 
dicate that speed of response, or what may otherwise be roughly 
classified as test-taking ability, is of considerable importance. 
Тһе present writer has also suggested elsewhere: *Perhaps the 
verbal factor in a series of highly speeded tests is so continu- 
ously measured (as a reading function) throughout all their 
elements that it acquires a commonality and becomes genera 
rather than specific.” (Crawford, 1940, p. 590.) 

The first of these studies just cited indicates that speed and 
comprehension in reading, though highly related in terms of 
group comparisons, are distinguishable with respect to individ- 
ual performance. Traxler (1938, p. 55), discussing specific 
cases whose level of comprehension decidedly exceeds their 
speed, states: “The marked power of comprehending literary 
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materials that is possessed by these pupils would not be dis- 
covered if it were necessary to depend upon the speed of re- 
sponses alone." Stalnaker's further investigation of Primary 
Mental Ability test scores emphasized still more strongly the 
general importance of response-speed, since he considers that 
influence as operative throughout the entire Thurstone battery, 
rather than with respect to verbal tests alone. He makes this 
interesting comment: 

“The correlation between the Thurstone primary mental 
ability score in the verbal factor (obtained by adding, accord- 
ing to his instructions, the two verbal tests) and the test of the 
College Entrance Examination Board was for this group .71. 
Thurstone estimates the correlation of his test with a test of the 
pure factor to be .68. If only one of the two tests which Thur- 
stone used for his verbal factor, the same-opposite test is con- 
sidered, its correlation with the longer test of the Board is .68. 
If the scores on only the first half of the same-opposites test 
are used, this correlation is raised slightly to .70. This latter 
method of scoring obviously gives a score less affected by the 
speed factor. In this case, cutting a test in half—using only 
half the items—scems to raise its validity—or at least not to 
lower it.” (Stalnaker, 1939a, p. 47.) 


Speeding and Commonality 


For some prognostic measures, statistically computed relia- 
bilities have been spuriously raised because their time limits 
(e.g., those in the Thurstone or Yale batteries and many sepa- 
rate tests) are considerably shorter than ideal conditions would 
afford. The references just cited, and review of other data as 
well concerning how various instruments operate, clearly sug- 
gest that among elements of a general nature response-speed 
per se is significant. К қ . 

Through every battery of objective, diagnostic or prognostic 
devices with which the writer is acquainted, there seems to run 
some general aspect of what has just been referred to as test- 
taking ability. The more highly short-answer tests (whether of 
the true-false, multiple-choice or other forms) are speeded, the 
more commonality appears among their supposedly disparate 
parts. Moreover, wide differences may be observed of output 
in essay examinations; 1.6.; in the length of answers to the same 
question within an allotted time. Nor is this by any means due 
solely to corresponding ranges of information: speed of re- 
sponse, sheer rate of production or even (manual) writing 
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facility, among others, are all variable influences affecting many 
forms of examination. These indeed may account in part for 
the differences found by Thurstone between his V (Verbal 
Comprehension) and W (Word Fluency) factors.1® 

It is obvious that speed itself becomes progressively more 
important as the number of questions asked within a given time 
limit substantially increases. True, the short-answer test does 
not demand free expression and therefore possibly—through 
extensive sampling—offers a better index of what a student 
knows (not how well he writes) than does the essay type of ex- 
amination. But the former typically demands rapid response, 
which in turn involves not only quickness of mind but excellent 
eye-hand coordination—not essentially an intellective factor. 
Moreover, the current type of objective test, employing sepa- 
rate answer sheets, transfers most of the clerical scoring func- 
tion from reader to pupil and particularly demands such co- 
ordination. Making the student record his answers on these 
forms—which effects great economy in time and effort required 
for either hand- or machine-scoring—therefore introduces а 
sort of clerical-aptitude test into every such instrument, em- 
phasizing still further the influence of sheer response-facility 
upon all results thus obtained. 


Stability of the Primaries Related to Age and Speed 


Before proceeding to a recapitulation of the criticisms ven- 
tured above and those made by certain other writers later cited, 
Thurstone’s own views as to why the PMA Battery (and even 
the isolated factors themselves) do not “stay put” in successive 
experiments demand attention. The Introduction to “Factorial 
Studies of Intelligence” states: “From the beginning of our 
work in this field we have frequently raised the question whether 
the primary factors could be isolated and appraised for younger 
subjects.” (Thurstone, 1941a, p. 1.) As suggested by the con- 
text, “younger” is a relative term contrasted with “a popula- 
tion of volunteer subjects among college students” (University 
of Chicago freshmen) utilized in the first large-scale research 
on Primary Mental Abilities. 

That the primaries have not remained entirely stable is now 
clear. This is also the case with the composites represented in 
Table 15. Experiments described in Thurstone’s second mono- 


18. Cf.: “Тһе word-fluency factor W is also one of the most clearly defined 
primary mental abilities. It is involyed whenever the subject is asked to think 
of isolated words at a rapid rate.” (Thurstone, 1941a, p. 3.) 
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graph led to at least temporary withdrawal of Factor P, now 
acknowledged to involve speed of response and therefore re- 
named “perceptual speed”; of R as such, though it has some- 
how been absorbed into I (“the inductive or reasoning factor I, 
which has also been denoted R”) and of D, earlier questioned. 
Thurstone considers the six remaining primaries V, W, 5, М, 
M, R,—apparently in that order—as “clearly defined, some 
better than others. . . . Тһе test validities are highest for the 
first three of these factors, namely V, W, and 8.” (Idem, р. 7.) 
P and D have proved the most troublesome to measure by tests 
and therefore are currently regarded as not yet clearly de- 
fined. W (Word Fluency) by nature proved inadaptable to 
measurement through objective responses necessary for ma- 
chine-scored tests and therefore is not mentioned among the 
“seven mental abilities which have been isolated to date,” as 
described in the student's individual report form (1938 edi- 
tion). T'hese, as of 1946 (1941 and 1943 editions), were listed 
as N, V, S, W, R, M, following the revision discussed earlier. 
Thurstone’s second Monograph frequently stresses the point 
that early age affects performance and suggests that differen- 
tial abilities evolve gradually. Other workers have been inter- 
ested in this general question of changes in mental organization 
as a function of growth. Clark (1944) conducted a study using 
946 students varying in age from eleven through fifteen and 
grades four through twelve. She used Thurstone’s PMA Bat- 
tery (1941 edition) and reported decrease in the intercorrela- 
tion of component scores as age increased (except for memory, 
which she found little related to other factors at any age). The 
important problem is whether “fanning out” of such abilities 
Occurs naturally in the course of time and educational growth, 
or whether excessive test-speeding obscures them. Also perti- 
nent is the question of whether our present instruments of ар- 
praisal are sufficiently sensitive to measure differential abilities 
at earlier ages, especially with divergence in grade levels. 
Had it been a secondary aim of Thurstone’s research to con- 
Struct an educational aptitude battery, he could readily have 
done so by selecting from the large number of tests originally 
tried out certain appropriate combinations as dictated by their 
relative correspondence with external criteria. In his various 
experiments Thurstone has employed a wide range of materials 
in search for the inner truths, or psychological realities, of in- 
telligence, and earlier was frankly more concerned with dis- 
covery along these lines than with whatever practical value his 
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(or any other) test battery might hold for differential predic- 
tion. In this as in certain other respects he has since taken a 
considerably broader outlook, with reference not only to test 
utilities but also to independence of primary abilities—and even 
the possibility that some general (e.g., perceptual-speed) fac- 
tor may overlap them all. 


Review of Criticisms by Other Writers 


In the 1940 Mental Measurements Yearbook, Garrett, Kel- 
ley, Spearman, Thomson and others discuss the Primary Men- 
tal Abilities battery. While time and space do not permit a full 
consideration of their remarks, certain abstracts which may be 
of most interest to the general reader are presented here. For a 
more complete review, this excellent anthology by Buros (1941) 
should be consulted. The abstracts follow: 

1. Kelley, T. L., on importance of the factors: “He ['Thur- 
stone] certainly should be called upon to show that they differ- 
entiate individuals in respects that are important in academic, 
vocational and avocational living if he proposes them as essen- 
tial rubrics, which he does in using the title ‘Primary Mental 
Abilities.” (Idem, p. 258.) 

2. Kelley, on timing and speed: “The tests are so timed that 
no subjects are expected to finish them within the time limits 
set, thereby making speed a function of each and every one. 
This raises the reliability coefficients, which are high as re- 
ported, but it lowers the purity of the measures. It introduces 
a correlation between the tests, which the author acknowledges, 
but does not use, nor does he note that it is largely due to a 
common speed factor. The most obvious measure to be gotten 
from these seven is one of ‘mental speed,’ but no scoring for 
such is provided.” (Ibid.) 

Tryon, R. C., on primary abilities: “The troubling 
thought is that the test—developed by such an eminent and 
brilliant authority and bolstered by an awe-inspiring mathe- 
matics—may set a new tradition of a few faculties of the mind, 
Just at the time psychologists are showing some signs of recov- 
ering from the pall of the IQ doctrine.” (Idem, p. 260.) 

a Stalnaker, J. M., on speed: “The most striking charac- 
teristic of the tests, obvious from inspection, is that they are 
all short speed tests. Manual dexterity in operating a pencil 
and speed of reaction are of unquestioned importance in almost 
all tests. Speed is so important, and the tests are so brief, that 
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slight errors in timing must assume major importance.” (Idem, 
р. 261.) 


RECAPITULATION OF COMMENTS ON THE PRIMARY 
MENTAL ABILITIES RESEARCH 

Lest the reader gain an unduly critical impression of the 
PMA Battery from immediately preceding quotations, it seems 
pese to summarize all preceding remarks thereon, as fol- 
OWS: 

1. Exceptional talents have undoubtedly been focused by 
Thurstone upon complicated, theoretical aspects and opera- 
tional techniques of factorial analysis. The result is a contribu- 
tion of outstanding significance to the science of mental meas- 
urement. "Technical criticisms, however justified by reason of 
undemonstrated earlier claims or extravagant hopes for this 
battery, are of minor import as compared with positive values 
attaching to this project as a whole. 

2. A great amount of statistical effort and skill has been em- 
ployed in applying these improved techniques to Thurstone’s 

MA investigation. 

8. Nevertheless, according to recognized canons of test con- 
Struction, certain procedures followed in gathering his original 
data and in resultant selection of the initial battery are ques- 
tionable on the following grounds: 

а. Failure to make adequate item-analyses of the individ- 
ual questions; hence no demonstrable scaling of current test 
elements in progressive difficulty. 

b. Excessive influence of sheer response-speed, occasioned 
by the short time limits for separate tests, resulting in spurious 
reliability of original scores and their derivatives. 

c. Rather high probable errors of the basic intercorrela- 
tions among various measures contributing to the present bat- 
tery, as to its predecessors. 

4. For the above reasons, cumulative doubts arise as to prac- 
tical validity of those fundamental intercorrelations, from which 
the factor loadings and consequent segregation of primary ele- 
ments were initially derived. No elaboration of mathematical 
Procedure, or subsequent precision in technique, can extermi- 
nate errors of measurement in the basic data. 

5. Numerous changes have been made, during the course of 
Subsequent experiments, in the nature and combination of 
testing materials employed as composite indices of the several 


214 Forecasting College Achievement 


primary abilities. This situation has both favorable and un- 
favorable implications with respect to progress thus far made: 

а. Favorable, as indicating that the experiment is in no 
sense "frozen," and that new inquiries are being objectively 
pursued with an open mind in respect to future developments. 

b. Unfavorable, in so far as doubts have arisen regarding 
validity not only of the measuring instruments as such (cf. 8 
above) but also of the primary factors themselves. Originally 
postulated as independent unities, these now appear to be vari- 
able in nature for different age levels or populations, and 
therefore not primary in a realistic sense. 

6. Formal recognition has at last been given to some gen- 
eral intellective factor (possibly response speed or perceptual 
speed) akin to Spearman's g and in some degree present 
throughout all PMA tests. Since Thurstone now prefers to call 
this a central energizing factor, it may be well to quote his exact 
words: “Our conclusion regarding this old question is then 
briefly as follows: There seems to exist a large number of special 
abilities that can be identified as primary abilities by the 
factorial methods, and underlying these special abilities there 
seems to exist some central energizing factor which promotes 
the activity of all of these special abilities.” (Thurstone, 1945, 
p. 8.) This situation amplifies the preceding challenge as to 
fixed existence of independent mental factors. 

7. As judged by external criteria rather than mere self- 
contained data, no studies reported thus far indicate that the 
Primary Mental Abilities Battery serves a notably useful pur- 
pose for individual measurement and student counseling; 
whether educational or vocational. We can but hope that, with 
further revisions and adequate trial against such standards 
of performance, it may in time achieve practical as well as theo- 
retical validity. Thurstone thus far seems to have evidenced 
little personal interest in such mundane affairs. 


у This chapter opened with a brief discussion of current theo- 
neta or trifactor (Spearman's); group factor (Thom- 
son's) ; and multifactor (Thurstone’s)—which have been ad- 
vanced to describe and measure the basic elements comprising 
human intelligence. These several concepts must not be confused 
with those underlying general intelligence tests (largely verbal 
or otherwise scholastic in nature) discussed in Chapter III; nor 
of course are the persons just named the only ones who have 
made vital contribution to factorial conjecture and practice. 
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Following this cursory survey of theories, one, as represented 
by Thurstone’s Primary Mental Abilities research, was selected 
for detailed analysis because of the wide attention it has at- 
tracted in psychometrics. The Spearman-Holzinger Unitary 
Traits Study is no less basically important but, as earlier noted, 
comparatively little has yet been published about it except in 
the form of technical and complicated materials. 

There is no reason why factor-analysis procedures should not 
be utilized in the analysis of achievement tests. However, they 
seem chiefly to have been associated heretofore with aptitude 
or other prognostic measures, perhaps because of the extensive 
work of Holzinger, Spearman, Thurstone, et al. Peters and 
Van Voorhis (1940, p. 276) point out that, as usually em- 
ployed, factor-analysis technique suffers from lack of adequate 
criteria. Only such factors can emerge as are represented in 
any battery. Hence it shows only what the tests have in com- 
mon, not necessarily the factorial components of each trait 
which one may be interested in measuring. On this subject, God- 
frey Thomson (1939, p. 80) offers the following comment: 
“the different systems of factors proposed by different schools 
of *factorists have each their own advantages and disadvan- 
tages, and it is really impossible to decide between them with- 
out first deciding why we want to make factorial analyses at 
all.” 

It is perhaps conjectural whether factor-analysis methods 
will in turn be superseded by other improved techniques, but it 
is probably safe to assume that their impact on the whole psy- 
chological testing movement will be felt for many years to come. 


CHAPTER VII 


TEST CONSTRUCTION AND THE MEASURE- 
MENT OF “IDIOSYNCRASIES” 


N Chapter II, certain basic concepts of statistical analysis 
1 were discussed by way of introducing to readers unfamil- 
iar with the terminology or procedure of quantitative 
methods, their applications in educational guidance. In turn, 
there followed a survey of various materials intended respec- 
tively to measure general scholastic intelligence, specific 
achievement, differential aptitudes and primary or unitary 
traits. Before attempting in Part II more detailed considera- 
tion of aptitude tests for particular areas of study, it seems 
advisable to discuss additional principles underlying sound con- 
struction and administration of all objective tests, whatever 
their special purposes may be. The present chapter in a sense 
parallels the earlier one on statistics, and likewise makes no pre- 
tense at full coverage of its subject matter. This will, however, 
attempt to outline further techniques through which the vari- 
ables inherent in any process of testing human abilities can be 
at least reasonably well controlled. 

Therefore we shall next consider problems related to system- 
atic means of evaluating test items, of safeguarding them from 
unrestricted circulation which might vitiate their utility and of 
care in administrative procedures. Brief attention will also be 
paid to the “perseverance of individual idiosyncrasies.” This 
formidable-sounding term represents a concept of major im- 
portance to the interpretation of differential aptitude scores. 
It raises the question of whether relative readiness-to-learn, in 
one area as compared with others, persists throughout the edu- 
cational years or is somewhat fleeting and changeable. That 
point will be considered later in this chapter, with particular 
reference to verbal versus mathematical talents. 


RELATIVE PRECISION IN EDUCATIONAL 
MEASUREMENT 
Statistical methods in general, and the techniques of test con- 
struction in particular, have a parallel relationship. The in- 
terpretation of quantitative data obviously requires statistical 
treatment. As such data become increasingly complicated, new 


Test Construction 217 


means of analysis develop and then new experiments (in the 
educational field, new or more specialized tests) are devised, 
leading again to further statistical refinements. One is inevi- 
tably reminded of the old query: «Which came first, the chicken 

or the egg?” 
| For the areas with which we are primarily concerned, this 
situation is exemplified by a contrast between methods of test 
construction and statistical analysis alike whicli served in the 
earlier days of mere general intelligence measurement, and 
those now current. Progressive developments, especially 
throughout the past two decades, have contributed new tech- 
niques to both test construction and analytical procedures. 
Moreover, they have emphasized certain desirable characteris- 
ties essential to a particular type of measurement. For example, 
it has long been recognized by psychometric authorities that 
the determination of group levels or differences (even in gen- 
eral ability) requires less precise instruments than are needed 
for reliable determination of like individual levels or differ- 
ences (Kelley, 1927, pp- 210-211). The further appraisal of 
specific, variant attainments or aptitudes within the individual 
himself demands still greater precision. 
If we seek mental measurements which can effectively segre- 
gate and contrast these differential abilities, relatively through- 
out each person’s total range of talents, two considerations are 
notably important. F irst, the tests employed must show a high 
degree of internal consistency ; i.e- in statistical parlance, with 
reliabilities not below -90 at least, and in most situations of this 
sort not below .95. Second, they should overlap (intercorrelate) 
ia minimun degree Hh whey рахе 40 mes e really separate 
functions. On this point, no such dogmatic position as taken 
liabilities сап be assumed or 


above with respect to internal reliabr ; 
defended, because few educational indices (whether prognostic 


tests or criteria of achievement) are wholly independent. Fur- 


thermore, some of these individually differential fields are by 
nature more closely related than others. Perhaps this difficult 
overlapping problem шау be somewhat clarified by the subse- 
quent discussion, in this chapter, of individual idiosyncrasies. 
Meanwhile certain other technical considerations relative to the 


development of testing materials will be considered. 


G THE MATERIALS ОҒ GUIDANCE 
ucting апу examination is 
f. In physical measurement 


PRETESTIN 


A major desideratum in constr 
some means of testing the test itsel 
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this can be readily accomplished within acceptable tolerance 
limits by calibration with some highly accurate external stand- 
ard. In mental measurement no such possibility exists; mental 
measurements can be calibrated only in terms of other mental 
measurements, themselves relatively inaccurate. The test con- 
structor is therefore continually plagued, as we have already 
noted, by inadequate criteria. Yet pretesting of his materials 
(by whatever means are possible and granting their inexact- 
ness) is far preferable to merely subjective guesswork as to 
respective validities. Experience has proven in many situations 
that neither faculty members, academic or professional exami- 
nation boards nor even so-called testing experts can foretell 
by armchair speculation how the questions they faithfully 
struggle to devise will actually work in practice. Interesting 
and pertinent evidence to support this view will be found in 
Brigham’s (1932) work, 4 Study of Error, and other publi- 
cations earlier cited. 

In recent years, considerable attention has been given to 

' various pretesting methods. These naturally are based upon 
one form or another of trial-and-error procedures, because that 
is what pretesting implies. One method is successively to try out 
various tests year by year upon relatively homogeneous groups, 
and through subsequent analysis to determine by correlation 
with appropriate criteria which instruments as a whole (also 
which particular items within each of these) prove most effec- ` 
tive. That has generally been the method employed in the de- 
velopment viz. of Cooperative Test Service or numerous other 
achievement, intelligence or aptitude measures, progressively 
revised through continuous research. 

Another sort of pretesting, less widely used because (unlike 
that just mentioned) it requires cooperation from institutions 
other than the examining body itself, consists of “spot” tryouts 
among appropriately chosen groups; e.g., eleventh-grade high 
school boys; college freshmen electing certain subjects; pre- 
medical or prospective law students; seniors majoring in one 
or another field. This procedure has been effectively utilized by 
various organizations or individuals when circumstances per- 
mitted. The Carnegie Foundation, for example, thus pretested 
its Advanced Level Graduate Record Examinations at various 
colleges. Several times as much material as could eventually be 
used was administered on a trial basis to “concentrators” in 
their respective areas. From internal evidence as to conformity 
of individual items with total score on the trial forms or with 
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other criteria, it is thus possible objectively to select questions 
of known difficulty and discriminating power (at specific edu- 
cationallevels) before an examination is finally assembled. 'T'he 
two pretesting methods just mentioned can, of course, supple- 
ment each other. 


Immediate Pretesting from Internal Evidence 


Still other means for determining the validity of new mate- 
rials, before responses thereto affect individual rank, may be 
utilized where large numbers are involved. One consists of 
scoring an entire examination and immediately making an item 
analysis of all its components. Questions which appear out of 
line with total performance can thus be eliminated, and final. 
net scores determined for each individual after the nondiscrimi- 
nating or “reversal” items have been identified and rejected. 
The United States Civil Service Commission has employed this 
procedure for some years in objective examinations; the New 
York State Board of Law Examiners, to mention but one more 
out of many possible illustrations, thus regularly investigates 
each short-answer part of its Bar Examination. Pretesting in 
this sense is somewhat of a misnomer, because the items are not 
actually tried out in advance; yet they do come under individual 
Scrutiny after initial scoring and before the candidates’ final 
standing is determined. Consequently “pre” here means prior 
to announcement of ratings, rather than to holding of the exam- 
ination. 

The College Entrance Examination Board has developed 
one rather unusual method of true pretesting: that of including 
various experimental sections within each edition of its Scho- 
Јавне Aptitude Test. For illustrative purposes, let us assume 
that 10,000 students are to be examined on a certain date, and 
that seven subtests jointly constitute the examination proper. 
Each of these sections has been made up from materials earlier 
tried out with like populations, so that their range of difficulty 
and discrimination is quite well established. Then an eighth 
(experimental) part may be so introduced that the student 
cannot recognize it as such. While performance thereon does 
not contribute to individual scores, motivation is maintained 
on this section (or perhaps on certain questions run in among 
others of like nature, already standardized) by so imbedding 
tryout materials within the total examination that candidates 
have no means of “spotting” them as personally inconsequen- 


tial. 


220 Forecasting College Achievement 


This procedure is highly effective in pretesting novel items 
or whole subtests, since evaluation of the new elements can be 
made in terms of their relative correspondence with known cri- 
teria or group characteristics. Moreover, a number of experi- 
mental parts can be tried out each year, by having (in this 
hypothetical instance) one new section represented in the first 
1,000 examinations; another in the second 1,000, etc. With a 
total population of 10,000 candidates, ten entirely new subtests 
might be employed in rotation and quite an extensive supply 
of new materials (thereafter also presenting known characteris- 
ties) thus analytically selected for subsequent use. 

Reference was earlier made to the sampling of groups tested 
and the importance of determining norms appropriate to a par- 
ticular situation. Sampling of the materials employed, through 
pretesting or trial-and-error methods, is no less essential to the 
production of a valid instrument. If circumstances do not per- 
mit adequate preliminary tryouts, then selective evolution of 
such materials can only be obtained through ex post facto re- 
search. Best results, of course, are obtained by combining both 
methods—i.e., pretesting plus follow-up analyses. The former 
is not so much an end in itself as a short-cut; it facilitates the 
inevitable follow-up process by making it possible to re ject cer- 
tain weak items or ineffective new “hunches” in procedure before 
longer-range studies are planned. Hence pretesting, despite the 
initial cost entailed, will prove economical of time and effort 
alike in the end. Yet this cannot eliminate the need for subse- 
quent investigations continually pursued, if any test is to re- 
main dynamic and vital rather than static and frozen. 


COLLEGE ENTRANCE EXAMINATION BOARD 
- TECHNIQUES 


Several references have already been made to College En- 
trance Examination Board procedures, especially regarding 
the Scholastic Aptitude T'est. Still others will follow as verbal, 
mathematical and spatial visualizing measures in particular are 
subsequently considered. Lest the writer seem unduly to em- 
phasize the research and techniques of that organization, it may 
be proper for him to disclaim any official connection with the 
College Board (except briefly as a member of its Advisory 
Committee on the Objective Test Series). He was, in fact, ear- 
lier regarded by certain members of the Board as a severe and 
even hostile critic of its methods. This was at a time when Col- 
lege Board Examinations and their objectives alike seemed 
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quite different and much narrower in outlook than they are 
today. Dean McConn, speaking some years ago before a group 
of educators, remarked: 

*We all know what the Regents’ Examinations have done to 
the high schools in the State of New York and what the College 
Board Examinations in their original form did to the whole 
group of Eastern preparatory schools. (Of course, it should 
be noted, in parenthesis, that the College Board situation is very 
different now, since they have developed their comprehensive 
examinations and scholastic aptitude test, and with their newly 
proposed ‘qualifying examinations’ we have, it seems to me, the 
first recorded case in natural history where the leopard has 
changed his spots).” (McConn, 1934, р. 116.) 

Anyone familiar with the Board’s increasingly broadened 
and fruitful activities during the past decade will doubtless 
agree that the natural phenomenon then observed was not 
ephemeral; certainly no atavistic return to earlier ways has 
Since taken place. Hence the writer, like many other school o 
college officials concerned with the problem of student guidance, 
now regards the College Entrance Examination Board as a far 
More progressive and widely influential body than it was 
throughout a considerable earlier period. While some of its 
methods are still open to criticism, this is quite different in 
nature from the old charges which are no longer appropriate. 


s to Modern Testing Procedures 


her chapters upon testing materials 
the College Entrance Examination 
Board simply recognizes the important contributions which 
that body has been making for some time past to measurement 
and guidance procedures. Some of those bearing upon the 
mechanics of test construction or administration merit special 
notice. The College Board is systematically conducting research 
studies! of a fundamental nature which may be expected to pro- 
duce valuable future results. As Dean McConn’s comment 
quoted above suggests: the Board at one time was regarded ав 
exercising considerable (and too restrictive) influence within а 
small and select circle alone. Now that its test scores on over 
30,000 students are reported annually to more than 300 educa- 


= le of such research studies in recent years is: Noyes, 

nm outstanding Pan Stalnaker, J. M. (1945) Report on the First Sie 

Б S, Sale, У. һ Composition with Sample Answers from the Tests of April 
eae Um New York, College Entrance Examination Board. 
‘un A 


CEEB Contribution 


Emphasis in this and otl 
and methods developed by 
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tional institutions throughout the country, and approximately 
600,000 young men were screened during a two-year period 
in connection with the Army and Navy College Training Pro- 
grams (College Entrance Examination Board, 1943a; 1944; 
1945, p. 14), it quite definitely stands forth as a national ex- 
amining body. Over 30,000 students were tested in April, 1946. 
College Board techniques have for some years reflected an 
admirable combination of pretesting and follow-up research 
(on essay-type as well as objective examinations). Certain 
other bodies already mentioned have, in their several ways; 
stressed corresponding experimental and research methods: 
notably the Carnegie Foundation in developing its Graduate 
Record Examination; the State University of Iowa in formu- 
lating under Lindquist’s direction the comprehensive battery 
outlined in Chapter IV; the Cooperative Test Service; the Na- 
tional Board of Medical Examiners, with reference particu- 
larly to its Medical Aptitude Test discussed in a later section; 
the Thurstones in their Primary Mental Abilities studies and 
to a lesser degree in preparation annually of the American 
Council Psychological Examination; Terman and associates at 
Stanford University through revisions of the well-known Stan- 
ford-Binet Intelligence Quotient measures, et al. By contrast, 
among the large number of tests listed in Hildreth’s Bibliog-: 
raphy (1939), in Buros’ Yearbook (1941), in the Psycho- 
logical Corporation's Annual Catalogue ( 1945), or elsewhere, 
many could be named which were either produced without bene- 
fit of adequate (if any) pretesting, or have remained in wide- 
open circulation for а. considerable period without change. 


Practice Makes Perfect 


_ The writer some years ago was requested for advice concern- 
Ing aptitude tests and rating procedures by a bright young 
lady appointed as "personnel technician? to a fairly large East- 
ern company. Having just been graduated from college, she 
was admittedly and refreshingly innocent of experience in such 
matters; it transpired that “she got the job" largely because 
her perfect score on the Otis Test (Self-Administering) had 
never been previously obtained by a candidate for employment 
with that concern. *Well, Mr. Crawford," she explained, “І 
really ought to have those answers down cold by now; I took 
that same test four times as demonstration material in different 
‘psych’ and education classes at соПеде!” Since the employ- 
ment supervisor had not seen fit to inquire about her previous 
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exposure thereto, the applicant felt it unnecessary to volunteer 
this information. Е 

The Otis Test? has doubtless attained its wide and continued 
use because of basic merits; but it hardly seems reasonable that 
any measure can enjoy unrestricted circulation for a long pe- 
riod (in this case, some twenty years), with but slight change 
in form or content, and suffer no impairment of efficiency. 
Granting the unusual nature of this particular incident, it 
serves to illustrate (even as reductio ad absurdum) a principle 
no less important for objective testing procedures than has long 
been recognized in respect to essay-type questions.® 

Professors carefully guard their examination papers until 
the fateful hour. Yet many intelligence or achievement meas- 
ures are freely marketed on request and repeatedly used. The 
writer has data on file from a large and well-known secondary 
school for boys concerning successive annual administrations 
of the American Council Psychological Examination to all its 
students. A majority of the latter, by senior year, had taken 
this test several times. Although the forms employed were not 
identical, their general pattern was similar enough to produce 
considerable practice effect; this may explain in part why the 
mean and general distribution of scores thereon, for seniors at 
that school, considerably surpassed national college freshman 
norms. 

Misuse of Unprotected Test Materials 


It is even said, without positive verification, that some teach- 
ers employ unprotected tests (i.e., those readily obtainable and 
not safeguarded as are those of the College Board or the Car- 
negie Foundation, for example) in advance coaching of stu- 
dents who will later be judged for college admission, scholar- 
Ship grants and educational placement by performance on 
identical or closely similar instruments. Such misuse of tests 
because of their general availability is nothing less than educa- 
tional treason and certainly represents a disservice to the stu- 
dents consequently misguided.* Crimes of that sort are indubi- 


_ 2. Otis, Arthur S. (1922, 1928, 1937). Success 
ing this long period, but the separate items 
and are virtually so identical in function, as no 


à revision of this popular instrument. А t д 
3. Douglas ратни CHAR); of Southern Illinois Normal University, has even 
urged legislation to control the distribution of educational tests and the 
licensing of those permitted to 
4. Cf. also the brief discussi 
General Educational Development, p. 


ive editions have appeared dur- 
have been so slightly modified 
t to represent in the true sense 


administer them. : 
on of the Armed Forces Institute Tests of 


128. 
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tably rare. It is much more likely that chance earlier exposure 
to certain instruments (e.g., Cooperative Achievement, Otis or 
ACPE measures) may later inadvertently affect individual per- 
formance on the same or parallel forms.? The importance of 
safeguarding test items of known value and of stabilizing ad- 
ministrative procedures will next be considered jointly. 


ADMINISTRATIVE PROTECTION OF STANDARDIZED 
MEASURES 
Certain methods long and consistently followed by the 
College Entrance Examination Board will serve to illustrate 
careful protection of materials accumulated at considerable 
expense. Ever since the beginnings of its Scholastic Aptitude 
Test, the following precautions have been observed: (а) ad- 
vance registration of candidates, to whom a Practice Booklet 
containing sample material is then issued, in order to acquaint 
them (prior to the examination itself) with the form of ques- 
tions it contains; (b) use of sealed booklets which remain un- 
opened until the students are instructed to break the seals ап 
start on the actual test; (c) insistence upon accurate timing 
throughout its administration ; (d) specific regulations as to 
recording and subsequently checking the identification numbers 
of all booklets distributed ; (е) absolute injunctions as to the 
return, from each examining center, of every booklet (used or 
unused) as soon as the examination is concluded. 
The College Board’s employment of sealed booklets and its 
gid insistence upon the immediate return of them all to a cen- 
tral office for scoring, may at first seem but troublesome minu- 
tiae; actually » both represent important contributions to mod- 
ern testing procedure. This strictly guarded method of adminis- 
tration practically eliminates the possibility of students’ becom- 
Ing acquainted in advance with specific test materials or being 
crammed” thereon by overzealous teachers. At the same time; 
care has also been taken to assure general advance orientation 
of all candidates, as noted above, through the use of a Practice 
Booklet illustrating the nature of questions later to be asked, 
mechanics of recording these on a separate answer sheet, etc. 
5. This situation T i itude 
Battery. Certain mater at Gri а эе које a ial d 


ministered to entering freshmen, some of whom had taken essentially the same 


test (except for intermediate revisions or higher-level forms) two or three 
years previously. 


ri 
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Control of Directions and Timing 


Still another development іп" examining techniques which 
this body has done much to improve (even if not, as with the 
foregoing devices, to initiate) is accurate control of time allow- 
ances and uniform, oral directions governing each actual test. 
These several procedures have in recent years become recog- 
nized as highly desirable and, in spite of the complications they 
entail, are gaining increased currency among carefully adminis- 
tered programs. Both the protective (sealed booklet) and illus- 
trative (practice booklet) methods, together with emphasis 
upon accurate timing and precise instructions, have been rig- 
orously enforced, for example, by the Carnegie Foundation in 
its Graduate Record Examination earlier discussed. It may 
seem outside our main topic thus to outline details of adminis- 
tration for particular testing programs. However, these are of 
considerable importance as exemplifying certain methods by 
which errors of individual measurement occasioned by mere 
laxity in procedure can be reduced to a minimum. 


BASIC IMPORTANCE OF EXPERIMENTAL CONTROLS 


A major principle of quan 
the exact physical sciences, in 
where, is that possible sources 


titative measurement, whether in 
economics, in psychology or else- 
of error be rigorously scrutinized 
and controlled so far as possible in experimental situations. The 


details mentioned above consequently represent notable efforts 


to anticipate and minimize procedural errors in mental meas- 
ions have on occasion been 


urement. While analogous precaut 5 
utilized elsewhere, they are seldom so carefully and consistently 
maintained as they have been in the foregoing instances and а 
few others. і ? 

A chief reason for lack of such controls, including proper ad- 
vance “tryout” of new material, in many testing situations is the 
expense they entail. It may be assumed that most educational 
experimenters or administrators would be eager to institute 
them, if facilities permitted. In this respect, the College Board, 
by reason of its ability to collect substantial fees, and the 


Carnegie Foundation through endowment resources, occupy 
S [ dmirably utilized. both 


fortunate positions. These have been аа 1 
in the furtherance of examination techniques and in the devel- 
opment of new testing materials. Hence information of basic 
value and suggestions as to promising lines of future investiga- 
tion may be obtained from perusal of their respective bulletins 
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or reports herein cited. They contain a wealth of material on 
the technique and philosophy alike of testing procedures. 


Indebtedness to Brigham 


In this connection particular reference should be made to a 
section of the College Entrance Examination Board Report for 
1938 entitled “Views of Associate Secretary.” In a few pages, 
the late Professor Brigham touched cogently upon several issues 
underlying the various problems of individual measurement and 
guidance. While his remarks (though all of interest) are too 
extensive for quotation in full, the few excerpts which follow 
seem especially pertinent to our general topic of differential 
forecasting at the college level: 

1. With respect to superior students: “The Executive Com- 
mittee at its meeting in November 1932 asked the Associate 
Secretary to recommend ways and means by which the examina- 
tions of the Board may be given greater prognostic value, espe- 
cially in the upper ranges." ? 

2. With respect to reliability: “Many workers in the field of 
mental measurement have long been keenly and painfully con- 
scious of the unreliability of their measuring devices. . . . In 
contrast with this point of view it is, perhaps, sufficient here to 
point out that colleges usually act as if the reported grade is 
whatever it is plus or minus zero. . . . 

“The discovered unreliabilities of examinations have led to the 
effort to eliminate some of the elements of unreliability by ob- 
jectivity of scoring. This desire to rate an individual more 
fairly has led to the ‘new-type’ movement, and has given that 
movement much of its reform character. The objective testers 
have plunged in with the individual rather than the institu- 
tional point of view. Questions of content of the curriculum, or 
the effect of such examinations on the curriculum, have simply 
not been raised, or, if raised have been regarded as heretical.” 

3. With respect to objective testing generally: “Тһе objec- 
tive test movement is founded on a sound principle of a mini- 
mum of error in the assigning of a grade to an individual. This 
decent notion has been expanded into an attack on every sub- 
jective method of rating, and a defense of every objective trick 
in rating, until the crusader spirit has converted good devices 


6. Italics by the present writer who, after some earlier years of experience 
with College Board emphasis upon negative selection—i.e., the mere exclusion 
of presumably unqualified students—was much impressed by this initial refer- 
ence to relative prognosis among definitely superior candidates. 
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for measuring into a prescription of what should be measured. 
Therefore, beginning without any notions of definitions of re- 
quirements, the new testers have really written requirements 
which are in fact whatever their tests may happen to measure. 
The movement has created new dogmas as obstinately defended 
as the old, and has missed the virility of a frankly experimental 
attack.” (Brigham, op. cit., pp- 8, 10, 11.) 


SCORING IN DEVIATIONS FROM THE MEAN 


The suggestions which Brigham thus made in 1933 have 


since led to various important changes in College Board and 
other educational measurement procedures; particularly to 
wide acceptance of a scoring system based upon individual devi- 
ations from the mean of all candidates taking a given test (as il- 
lustrated in Chapter II). The impracticable attempt to grade 
students on essay-type examinations, each supposedly reflecting 
some concept of perfection, has given way to a more realistic 
method of ranking each candidate among his fellows. Absolute 
passing or honor grades are thus replaced by self-descriptive 
indices of relative performance. 

This is far from a new concept in mental evaluation. It stems 
indeed from principles enumerated by Galton (1870, p. 26) in 
his classic work, Hereditary Genius,’ viz: “I propose in this 
chapter to range men according to their natural abilities, put- 
ting them into classes separated by equal degrees of merit, and 
to show the relative number of individuals included in the sev- 
eral classes. . . . Тһе method I shall employ for discovering all 
this, is an application of the very curious theoretical law of 
‘deviation from an average.’ ” 

The efforts of Karl Pearson and numerous other contributors 
to mathematical and statistical procedures were required to 
transform the native ore of Galton’s genius into the modern 
tools of measurement. A notable aid to this process was offered 
more than twenty years ago by McCall (1923) in his proposal 
of a *T-scale," (actually a sigma scale) specifically expressing 
individual grades or scores as true (standard) deviates from 
the mean of respective distributions. Many refinements have 


7. Page 28 of this same work presents а rough but fundamentally correct 
“normal distribution" chart. Italics above are again by the present writer, to 
emphasize Galton’s earlier conception of “equal-appearing intervals.” The 
passage quoted and other of his early suggestions led the way to eventual 


development of percentile and analogous rank-order scales. 
8. “T-Scale” represents McCall’s personal tribute to both Terman and 


Thorndike for their development of the principle initiated by Galton. 
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later been made in these earlier methods, but they are all based 
upon essentially the same principle. This has but recently been 
applied, on any extensive scale, to subjectively graded examina- 
tions as well as to objectively scored tests. Analogous efforts 
made by certain departments of study or educational institu- 
tions as a whole to redistribute most, if not all, originally re- 
corded marks in conformity with some predetermined curve are 
usually based upon local standards and limited population 
samples. Moreover, these procedures, by “forcing” normal dis- 
tributions into a statistical cold-frame, may operate quite un- 
fairly through failure to recognize varying levels of ability 
among different class groups. In advanced courses especially, 
pre-selection (whether by individual choice or the process of 
survival) often warrants an abnormal distribution massing 
above the general average. Uniformity in the mean and in the 
spread of subjective grades should not be artificially cultivated 
regardless of true growp characteristics. On the other hand, 
where less advanced classes of demonstrably similar general 
promise are concerned, it seems unreasonable to have wide 
fluctuation in departmental grading standards. 


SCALING TEST ITEMS 


Reference has earlier been made to methods for establishing 
the respective difficulty and discriminative power of individual 
questions, component parts of a test (1.е., a series of closely re- 
lated and functionally similar items), total separate examina- 
tions or entire batteries. When larger units (such as particu- 
lar aptitude or achievement tests as a whole, or some organiza- 
tion of these into a differential battery) are standardized, the 
process especially involves consideration of appropriate norms 
—i.e., the level and range characterizing abilities in general 
among various groups to be examined. As previously noted, this 
represents one important aspect of sampling among individuals 
or growps—somewhat as major and minor baseball leagues ог 
football conferences are intended to equalize competition within 
each of them. No less important is the problem of sampling test 
questions by some method, so that they too will be playing in 
their own league (the examination as a whole). 

Items which are so easy for the particular group of students 
examined that almost everyone answers them correctly, or 50 
hard that scarcely anyone does, obviously contribute little to the 
desired evaluation. By contrast, a measure composed of ques- 
tions whose difficulty varies but little throughout its extent em- 
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phasizes sheer response-speed rather than intellectual power. 
Such a test may serve useful purposes where routine clerical or 
simple arithmetical functions, for example, are being appraised 
with reference to equally repetitive and routine tasks. However, 
it would contribute little in differential prognosis attempting 
to gauge significant variations of readiness-to-learn, whether in 
the shop or the classroom. 


“The Higher the Harder” 


Ideally, an objective (restricted-answer) test should repre- 
sent a ladder of increasing difficulty, with successive rungs more 
and more widely spaced as one ascends it. If each step upward 


is the same, though some people will climb more rapidly and 


higher than others, their endurance is then chiefly being meas- 
ured—not their progressive ability to cope with increasingly 
strenuous tasks. Any examination (whether essay or objective 


in nature) should first be reasonably difficult for the group 
tested; i.e., selected with appropriate norms in mind. Next, it 
should adequately cover the presumed range of talent within 
that group. Essay questions on local examinations (so-called 
finals or comprehensives) are normally prepared with specific 
reference to the subject matter and level in question. Although 
these often vary in rigor; it is seldom possible for their individ- 
ual constructors to determine the true degree of difficulty in ad- 
vance; pretesting under such circumstances is usually imprac- 
ticable and will therefore not be discussed here in relation to 


essay examinations. It has nevertheless been carried out in 


recent years for certain College Board examinations, notably 
the English Essay. 


However, predetermina 
for accuracy of measuremen 


tion of difficulties is both possible and, 
t, essential in assembling an objec- 
tive test. A measure of this sort composed of short-answer ques- 
tions, however appropriate to the task in hand, gains in dis- 
criminating power through internal analysis and tlie elimination 
of items having little differential value. Even an essay question 
which is rather simple (in the sense that most students attain 
at least a passing grade thereon) may enable some, in their own 
parlance, to “really go to town" and consequently demonstrate 
marked superiority. Restricted-answer items, by nature, do 
not offer that possibility. А candidate may know far more about 
the topic presented than is necessary merely in selecting the 
right answer among four or five choices; yet he has no oppor- 
tunity to enlarge upon that knowledge, beyond marking X in 
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the proper spot or blacking one of several spaces on his answer- 
sheet. Hence objective tests must be scaled in difficulty of suc- 
cessive questions if they are to be efficient; there is no other way 
of endowing them with power rather than mere speed-of-re- 
sponse values. 

Subjectively graded examinations depend, for effectiveness, 
upon the personal judgment of constructor and reader alike; 
objective tests upon some impersonal (presumably scientific) 
method of scaling each item according to difficulty. How exten- 
sive this range of difficulty should be—i.e., what proportion of 
candidates is likely to pass even the hardest or fail the easiest 
questions—is a moot point, dependent as we shall see upon par- 
ticular aims. 

The following schematic diagram (Figure XIV) illustrates 
a spread of difficulty often sought in objective measurement to 
produce a normal distribution of results. It will be noted that 
progressive rungs of the ladder are more widely spaced here in 
both the upper and lower than in the middle ranges. 
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SERIAL ORDER OF ITEMS 
Figure XIV 


Curve Illustrating Theoretical Series of Test Items Sym- 
metrically Distributed According to Index of Difficulty. 
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_This hypothetical pattern suggests that almost all candidates 
will correctly answer the initial questions, while but few are ex- 
pected to cope successfully with the final ones. Тһе index of 
difficulty for each item is theoretically determined from previ- 
ous experience with groups presumptively comparable in talent 
to those now being tested. If that presumption holds, most of 
these items will, as intended, fall between the 10% and 9070 
range of difficulty and yield normal discrimination between 
those limits. 

'The items comprising top and 
naturally serve quite separate purposes. Many test constructors 
deliberately set the beginning items on any section at a low level, 
for purposes of encouragement and orientation. Such questions 
in themselves are of little value; yet they do illustrate the an- 
Swering process, or “stunt” demanded, and further serve as a 
preliminary warm-up exercise for all candidates. Conversely, 
the most difficult items (about 10% in a distribution like that 
illustrated above) are intended to rank the few persons of exv- 
ceptional talent, within the group and at the level tested. There- 
fore they are quite steeply scaled in difficulty; each in turn is 
“tougher” than its predecessor to a greater degree than prevails 
among the middle-range questions. 

These latter items particularly should be of a power type, ex- 
tending even the best students to their respective limits of ca- 
pacity. It is consequently important that time allowances permit 
superior candidates to reach the most difficult questions. There 
is little point in having these at all if they cannot be attempted ; 
discrimination among the ablest persons depends upon their ex- 
Posure to the hardest items, which have been included for their 
Special benefit. It is obviously desirable that there also be a 
sufficient number of such difficult questions (in some circum- 
stances decidedly more than 10%) to ensure reasonable identi- 
fication and measurement at the upper level. 

An examination constructed in terms of increasingly difficult 
steps throughout its entire range is illustrated by the next hy- 
pothetical curve (Figure XV). Here the curve denotes positive 
acceleration in rigor of questions as they proceed ; it represents 
an ideal test of power, in which the candidate theoretically 
reaches his top limit because he cannot go any higher, not be- 
cause time is called. This is more or less the type of continuous 
progression in severity of tasks assigned, at successive intellec- 
tual levels, which "Thorndike sought when developing the 


CAVD scale discussed in Chapter ІП. 


bottom tenths in difficulty 
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SERIAL ORDER OF ITEMS 
Figure XV 
Curve Illustrating Theoretical Series of Test Items Distrib- 
uted According to Steadily Increasing Steps of Difficulty. 


Different Testing Aims 


The foregoing examples reflect typical aims; still others exist 
in particular situations. It may be preferable for a critical pass 
or fail level to occur at some point above or below the mean. 
Licensing examinations, such as for admission to the Bar, to the 
practice of medicine, teaching or dentistry, for example, have 
chiefly a negative purpose—exclusion of those unable to meet 
established. minimum professional standards. Candidates not 
excluded by the process are admitted without distinction as to 
the degree of excellence among them. Those who just barely 
clear the hurdle enjoy the same privilege to practice as those 
who soar far above it. While the latter presumably should 
achieve highest ultimate distinction in their respective careers» 
all tickets issued thereto are of the same color and for genera 
See only—none stamped summa cum laude or even “ring- 

e; i 

Ап examination of this type consequently may not yield а 
normal grade distribution, Rather than to discriminate among 
candidates of superior talent, its main purpose is but to separate 
the sheep from the goats. On the other hand, measures of what- 
ever type employed in the selection of scholarship and fellow- 
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ship recipients should yield dependable rankin, withi 

of top-flight candidates. Hence an Ea os 

Ps the first purpose (negative selection, or really exclusion 

9 f the unfit) might well be constructed so as to give a skewed 
istribution of scores below the mean ( Figure XVI), where fine 

discrimination is most needed to determine the "cutting-off" 

point for admission. 
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Figure XVI 


T 
[e Intended to Identity the Bottom Quarter for Purposes of Negative Se- 
on. Quartile Points with the Following Values Are Indicated: Qı = 56; 
Median = 69; Q3 = 79. у | 
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TOTAL SCORE ON TEST 
Figure XVII 


Test Intended to Identify Relative Superiority within the Top Quarter for 
inis are Indicated with the Fol- 


шгровев of Positive Selection. Quartile Рош 
lowing Values: Q: = 21; Median = 31; Q3 = 44. 
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By contrast, the second purpose, requiring as it does fine 
discrimination in the wpper ranges, is best served by a positively 
skewed distribution of scores (Figure XVII)—1.e., one which 
does not attempt to distinguish among the below-average un- 
wanted goats, yet carefully grades the quality of fleece pro- 
duced by each among the seemingly finest sheep. This rather 
crude simile is by no means intended to suggest that superior 
rank may be attained through mere ovine complacency. => 
either case, test items should be selected to congregate about 
an off-center point—i.e., with the prospective distribution of 
scores deliberately planned to sharpen distinction within either 
the higher or the lower ranges, as respectively desirable. Re- 
sults obtained under those circumstances might approximate 
one or the other form outlined above. There, neither median 
falls near the midpoint of scale values, but at points 69 and 81 
respectively. х 


FUNCTIONAL UTILITY OF EXAMINATIONS 


Many educators and test constructors have for some time 
made rather a fetish of normal distribution in scholastic grades 
and particularly among objective-test scores. It was pointe 
out earlier, in the chapter on statistical principles, that assump- 
tions of reasonable normality underlie the calculation of certain 
indices; such as the standard deviation, the Pearson r ОГ the 
critical ratio of observed differences. Consequently it may often 
be desirable to strive for a normal distribution of test scores, 50 
that these may be correlated or otherwise readily compare 
from one measure to another. Yet statistical desiderata or COP” 
venience need not dominate all mental measurements, regard! 58 
of their functional utility. The purpose for which some particu 
lar measure is intended may——as suggested above—be better 
served by skewed or otherwise nonconformist distributions the" 
by one of the usual bell-shaped type. In professional licensing 


арон for example, care should be devoted to ware 
е discriminations at the point of greatest significance 7". 
the established passing i or dw That is. сне ы 
a curve of normal distribution fails to do, especially when e 
pass mark is established somewhere around the median (i-¢ he 
tween percentiles 40 and 60) of test scores. If most of Шү 
congregate there, gradation at the critical level is relativ? d 
coarse. 

A cogent discussion of this topic by Jackson and Ferguso" 
further emphasizes several of the points touched on 2 он 
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These authors suggest, if the purpose of an examination is 
primarily to distinguish passed from failed candidates, that a 
bimodal or even U-shaped distribution rather than one ap- 
proaching normality will better indicate where the critical line 
should fall. The concluding paragraph of their article is much 
to the point: 

“It is going to be rather difficult for statisticians if tests 
yielding such peculiar distributions of scores ever come into 
general use. Most of their theory, tests of significance, etc., are 
based on the assumption of a normal distribution of the vari- 
ables and could not be directly applied in cases such as we have 
considered above. This is the statistician’s headache, however, 
and is of minor importance. We want tests to do specific jobs, 
and, if we can succeed in constructing tests which do this satis- 
factorily, we have solved our problem. The amount of work and 
trouble this gives others need not concern us if we know what we 
require and can get it.” (Jackson and Ferguson, 1943, p. 28.) 

This same argument perhaps applies to certain College 
Entrance Examination Board procedures. The present writer 
at least feels that, in some respects, the Board authorities 
have permitted technical emphasis upon theoretical perfection 
in construction of objective aptitude and achievement measures 
at times to overpower practical (i.e. functional) objectives. 
While Jackson and Ferguson make no reference to the College 
Board or other specific examinations, their plea for differential 
utility rather than mere statistical excellence is pertinent. 

Little criticism can be made of the CEEB Scholastic Apti- 
tude Test (Verbal Section), further discussed in Part II. It 
seems to be the best single? verbal test yet constructed. Our only 


question is whether it would not prove almost as effective within 


ess generous limits, thus saving valuable time within the total 
t of other dif- 


allowance usually obtainable, for the measuremen 
ferential aptitudes or achievements. 
: Relatively more serious arguments can be brought, we be- 
lieve, against certain of the College Board's Mathematical 
Aptitude and Spatial Visualizing instruments. The writer's 
contention is that, for the sake of high internal consistency and 
to satisfy a technical-perfectionist complex, some one type of 
question has at times been selected for certain of these tests and 


r IV, p. 122) the Graduate Record Examination, 
"version of the SAT to measure 
ak this down in terms of three 


9. As noted earlier (Chapte 
MEET heretofore has employed a “stepped-up 
he Verbal Factor in general, now plans to bre 
More specialized vocabularies. 
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then wearisomely protracted with little or no change in form. 
Granting that such items are punctiliously scaled in order of 
difficulty within a single examination, they measure but one 
mental process over and over, more exhaustingly than seems 
warranted. The consequent exclusion of other functionally 
significant tasks limits the range of sampling within the total 
area represented. Fifty, sixty or more successive items all simi- 
lar in nature which merely ring increasingly difficult changes on 
algebraic equations, or an equal number of problems in match- 
ing slightly different irregular blocks as time drags on, tend 
(from actual student testimony) to become less exacting than 
merely tedious. 

The College Board authorities maintain an admirable and 
vigorous distrust of unduly short subtests. Each separate sec- 
tion, they feel, should be long enough to ensure high individual 
reliability and accorded sufficient time to minimize the danger 
of overspeeding. Possibly the particular devices just mentioned, 
some of which were assembled for special purposes or partially 
to try out new materials and provide means for thorough item 
analysis, are not characteristic. Yet at least one such repetitious 
mathematical aptitude test was employed not long ago in the 
CEEB April series, and another more recently to measure 
spatial visualizing. 

_ It may seem ungracious for the present writer, who is genu- 
inely indebted to the Board for its cooperation in aptitude test- 
ing projects at Yale, to cavil at this maximum-reliability ob- 
session. Yet he feels that certain of the suggestions by J ackson 
and Ferguson (op. cit., 1948) as to functional, rather than 
idealistic, values in test construction merit serious attention bY 
this and all other responsible testing bodies. An unavoidable 
contest frequently develops between (a) width of coverage 
(adequate sampling of each differential area with reference 10 
several miniature tasks rather than one) and (b) individua 
depth in the appraisal of each. Neither objective should be 50 
emphasized ав virtually to eliminate the other. Disregard а 
basic statistical principles, or insufficient depth of testing with 
consequent low reliabilities, can produce only meaningless T6 
sults. But excessive concentration upon statistical or perhaps 
unrealistic ideals, for their own sake, may prove equally fruit- 
less if the major aim of differential guidance becomes lost in 4% 
tails of precision. The College Board seems occasionally to от“ 
get the forest as a whole, in its minute examination of each tre 
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SCALING AND TIME LIMITS 


Many tests currently employed for one purpose or another 
have not been adequately scaled for progressive difficulty of 
their items, as established by previous analyses. Others have; 
notably those developed by the Cooperative Test Service, the 
Carnegie Foundation, the College Entrance Examination 
Board, some of the state-wide testing programs mentioned 
earlier, and certain private ones. Yet the time limits accorded 
some prescaled tests are often inadequate. Furthermore, the 
very item analyses governing their composition may themselves 
earlier have been seriously affected by the same complaint— 
inadequate opportunity for all questions to receive a fair trial. 

4 Let us suppose, for example, the last 20 items in a 100-ques- 
tion test to have been selected for terminal responsibility—dis- 
tributing the best students—because on previous occasions they 
had been answered correctly by a relatively small proportion of 
candidates, most of whom attained high final rank. But perhaps 
Іп such pretesting or item analysis the seeming difficulty of these 
20 questions in part arose from their position in the earlier 
Series. Obviously this would have affected determination of their 
difficulty index. Hence care should be taken to grade difficulty 
In terms of “рег cent right” among candidates reaching and 
attempting each item. This principle may seem too obvious for 
mention; yet it is at times neglected in item analysis. Discrimi- 
nation should be appraised by the relative proportion of supe- 
Yor, average and inferior students (as judged by total test score 
Or some external criterion) answering each question correctly. 
Questions placed toward the end of any speeded test, which few 
Students reach, are unlikely to be evaluated with assurance. 
Those persons who attempt them at all are likely to comprise 
the most able or at least the fastest-working group. Hence mere 
Position in the total series of items may considerably affect 
the index of difficulty (per cent answering correctly) or dis- 
criminating power (agreement with total scores on the same 
test). More exact procedures, such as biserial r, for measuring 
the discriminative power of specific items were briefly mentioned 
m Chapter II and ате more fully discussed in the various statis- 


tical references there cited. 
Variable Weighting of Different Responses 


Another shortcoming of objective-test procedures is the in- 
erent assumption that all wrong answers are equally bad. Re- 
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finements in scoring have been developed to eliminate credits 
obtainable merely by guesswork or chance. If a true-false ex- 
amination evenly divides the correct answers between those two 
categories, it is obvious that any person will answer half the 
total items correctly if he marks all of them either true or all 
of them false. If he knows half the correct answers and so in- 
dicates, he may obtain a total score of around 75% right by 
the simple device of marking as true all those he does not know. 
If subtle enough to count up the responses of which he is rea- 
sonably sure and find a majority entered as true, he would have 
a still better chance to raise his total score by checking all the 
rest as false. 

To contravene such possibilities, or the effects of random 
guessing which is likely to produce at least some credits, a for- 
Wrong” 
X represents one less than the number of options: i.e., Correct 
Score — Total Right Answers Total Wrong ADETE тұла 

(N — 1) options 
for a true-false test the score after correction for guessing = 


4 Wronosi 
Rights — -p n hence simply Total Right minus Total 


Wrong. For a multiple-choice item offering five options the 
resultant would be Total Right minus 14 Wrong. Such for- 
mulas are based upon a theoretically valid, but practically un- 
sound, assumption that chance or guessing responses will be 
equally distributed among various alternatives. While this шау 
be true of simple true-false questions, it seems likely to become 
less 80 as the number of options increases, as in а four or five 
multiple-choice item. There, in most instances, one or two of the 
choices (sometimes known as distractors) are patently wrong 07 
absurd. For all but the most naive or unintelligent subjects; 
choice is quickly concentrated upon fewer possibilities ; usually 
two or three. Hence any such arbitrary correction is unrealistic- 
Moreover a matching test (wherein, for example, eight, ten 0T 
even twenty names or events in one column are to be matche 

with, or chronologically related to, other names and events іп 8 
parallel column) offers a complicated series of permutations 


; i Ww 
which no Right — a formula can adequately solve. These 


meth ods, евр есіп у where matching and four or five multiple- 
choice items are involved, do not actually fulfill their ostensible 
purpose. In fact, they fail to strike at the root of this objective 


mula has been devised on the “Right — principle. Here 
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scoring problem, which may be characterized as: How far wrong 
is a “wrong” answer? 

One should not take for granted (as this type of formula 
does) that options 1, 2, 4 and 5 in a multiple-choice item to 
which 8 is the correct response are equally bad. Indeed, McNa- 
mara and Weitzman (1945, p. 112) report that placement 
of choices in the several possible positions in four- and five- 
choice questions has some effect on the difficulty level of the 
questions. They state: “In both types of test questions the pe- 
nultimate, or next-to-the-last, position is the one having the 
greatest difficulty level. It is in both cases more difficult than 
any other position to a statistically significant degree.” This 
comment involves what psychologists and other experimenters, 
especially in psychophysical research, have long recognized as 
“position error.” When fine discrimination among shades of 
color, weight of like-sized objects, quality of musical tone or 
pitch, etc., is called for, relative position in a closely graded 
series affects response. It has consistently been found (as stand- 
ard, modern texts in experimental psychology demonstrate) 
that position 3 in a series of four, and З ог 4 in a series of five, 
options where no actual difference exists are preferential. 
Hence in careful psychological experiments—whether with 
mice, monkeys or men—the order of presentation follows a reg- 
ulated pattern such as ABCD; BCDA; CDAB; DABC. Well- 
informed test constructors likewise distribute their correct an- 
Swers equally among the different serial positions, but the 
foregoing citation suggests that errors attributable to the place- 
ment of right answers on many tests have not been adequately 


b а. qun "ET 
Old-form, essay questions often yield low statistical reliabili- 
ties because their grading 18 subjective and may therefore vary 
among different readers. New-form, restricted-answer questions 
escape that difficulty because а single predetermined response 
alone is considered right. Unless all the wrong options are pat- 
ently absurd—in which case the question would be of little 
value—degrees of rightness characterize some of the unaccepted 
answers, Metaphorically, if 3 in the case postulated above is 
pure white, 1, 2, 4 and 5 may not all be pure black; eB 2 рек 
haps has a medium and 5 but a faint gray tinge. Yet they a 
receive a pure black rating, as equally worthless. КИ 
Theoretical formulas to penalize guessing (or systematic re- 


do not know that they are being del 
objectively amoral. 


uded by “по- 


ae The subjects, of courses 
difference” measures. Experimenters are 
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sponse of some kind) even where knowledge is wholly lacking 
take no account of these gradations. They also operate in terms 
of pure black or white; moreover total scores on most objective 
tests correlate so highly (.90 to .95) with “corrected” scores 
(Right— —BRs 
significant. The crux of the matter seems to be: сап meaning- 
ful differential values be assigned to the several possible answers 
presented by some multiple-choice items, as they would be to 
free-answer responses on the same topic? As earlier suggested, 
the fact that scoring such items is objective and reliable—be- 
cause it is done by stencil or electrical contact—obscures the 
subjectivity which often affects their original construction. 
This criticism applies particularly to verbal tests (wherein the 
supposedly right nuances among synonyms or antonyms, for 
example, may not take account of changing language patterns, 
variant connotations or geographically different idioms) and 
to tests of reasoning, logical inference, relevance, proof, etc. 
(wherein qualitative judgments rather than quantitative data 
or recall of facts are concerned). 

Brigham (1932, pp. 65-162), in his detailed analysis of а 
verbal (synonyms) test, made it clear that other answers than 
the one predetermined as right are by no means equally wrong. 
Тһе objective-test constructor too often regards his own de- 
cision (again we stress its possible subjectivity) as final an 
dismisses as equally worthless any variations from his judg- 
ment. Even granting that he is right on every moot point (i.e 
where the answer is debatable rather than quantitatively OY 
factually unquestionable), no credit is normally given on ob- 
Jective tests to any but the one accepted response. Research on 
this problem has received limited attention. Conrad (1944) re- 
ports that three studies!! have shown no evidence of more than 
negligible improvement (in reliability or validity) when differ- 
ential response weighting was used instead of the simple sum- 
mation of raw scores. 

Despite the negative findings which Conrad reports, it is our 
fecling that further research along the lines suggested by Brig 
ham might indicate some value for weighted scoring of certain 
verbal and qualitative judgment tests, as possibly superior to 
the current practice of regarding one answer as spotlessly white 
and all others as coal-black. Accepted procedure in item analy- 


11. The studies reviewed by Conrad were: Casanova (1942); Guilford, ror 
and Williams (1942) ; Phillips (1943). 


) as often to make this sort of adjustment in- 
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sis of objective tests is still to discard nondiscriminating or re- 
versal questions (so classified according to proportionate re- 
sponses throughout, compared with the individual's total-score 
rank). This procedure again is based upon the assumption that 
each right answer should receive uniform credit and that all 
others should be consigned to limbo without differentiation 
among them. A more qualitatively refined system of scoring 
should provide even sharper discrimination, over a wider range, 
than has yet been obtained in measurements of this sort. Numer- 
ous complications would ensue, which may account for its not 
having been introduced thus far in achievement or aptitude 
testing, Yet differential scores have long been employed, be- 
cause of their admittedly subjective nature, with interest, 
personality, attitude and preference ratings. The essential tech- 
niques are consequently available for transfer to other measure- 
ment situations wherein they seem no less appropriate. 

This discussion is somewhat theoretical; it is introduced here 
not to illustrate previous or even current procedures in test con- 
struction but to point out what the writer suspects is a serious 
lack therein. Its chief purpose, moreover, is to challenge that as- 
pect of multiple-choice testing which accepts one alternative re- 
Sponse as 100 per cent right and all others as 100 per cent 
wrong. Unlike the essay-type question, an objective test makes 
impossible the determination of thought processes leading to 
the student’s choice, and grades him implacably on the basis of 
each choice without information as to how and why he made it. 
No degree of statistical reliability or accuracy in scoring can 
Overcome that limitation on restricted- as compared with free- 
answer questions, but its effect might be lessened (especially 

or powerful items of the judgment type) through use of vari- 
ably weighted scores. 

Speculation on these somewhat technical details may seem 
Out of place in a general survey of measurement procedures. 
The points mentioned are intended chiefly to stress again the 
Mportance of adequate controls and unremitting research in 
Sound test construction. Practical obstacles often complicate 
these efforts; for example, the doubts just raised as to authen- 
ticity of item-scaling apply with particular force to the Yale 
Educational Aptitude Battery. Its original time limits of forty- 

ve minutes per test were adopted not by choice but perforce, 
So that each could be administered within а normal classroom 
Period and all within the total time obtainable. 'This stricture 
impeded test revision because too few students had sufficient 
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opportunity for answering the later, and ostensibly most dis- 
criminative, items. 

Continuous attempts have been made to arrange questions 
in this battery from easier to progressively harder materials, 
but certain studies have shown that the results desired were by 
no means consistently obtained. Numerous experiments have 
encountered the same basic deterrent: pressure on time allow- 
ance is unremitting. Schools and colleges alike prefer short 
tests and expect a miracle to be performed within periods nor- 
mally devoted to the irregularities of a few French verbs or to 
some no less irregular incident of the French Court under one 
or another Louis. Hence, the opportunity for administering 
aptitude measures long enough to be individually reliable 
through power rather than speed factors, and for really search- 
ing analysis of their component items, is difficult to obtain under 
existing conditions. Cooperative Tests and many others which 
are given within the school year suffer like restrictions; they 
must conform to schedules maintained by the ringing of bells. 
Until measurement procedures are freed of restriction by time- 
clock-punching demands, they will continue to be under а dis- 
tinct handicap in construction and performance alike. 


THE PROBLEM OF "IDIOSYNCRASIES" 


The present chapter has thus far dealt with certain techni- 
calities relating to the construction and administration of test- 
ing instruments. Methods were discussed for evaluating the 
effectiveness and comparative difficulty of such instruments 
prior to their use in actual guidance; for protecting materials 
of known worth from the sabotage which free and easy circu- 
lation invites; and for advance orientation of each candidate 
to the testing materials in general through use of a practice oF 
sample booklet. Also mentioned as desirable were: the preve?” 
tion of too specific foreknowledge and even coaching on freely 
available instruments; uniformity in timing, administrativ® 
procedure and directions; pretesting and careful scaling of test 
items as to their difficulty. 

These principles are important for every phase of mental 
measurement, and particularly to that bearing upon individua 
differences or idiosyncrasies. Hence they merit considerably 
more attention than it has been possible to include within the 
foregoing cursory remarks. Another basic concept or problem, 
especially pertinent in discussing relative educational aptitudes 
throughout the chapters of Part II, is: how stable are the seem" 
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ing variations of personal readiness-to-learn for this or that 
major field? Are they persistent or ephemeral, fundamental or 
mere whims of chance, subject to change without notice? 
These are difficult questions to answer with assurance, based 
upon established data. A biologist may observe several experi- 
mentally imposed mutations occurring among successive gen- 
erations of fruit flies within one summer; à psychologist can 
experiment with white rats bred, fed and later given a “college 
education? through conditioning under rigorously standardized 
controls. Research on the human animal, however, presents far 
more complex and longer-range problems. His life span is no 
less than his investigators; the subject indeed will probably 
outlive the experimenter! Though new experimenters grow up 
_to observe new subjects, all are then different beings, them- 
selves (quite unlike the laboratory rats) variously conditioned 
у progressive changes in their environment—wars, booms, 
Panics, machines, vitamins, mores, educational fads, the church 
ала the radio—innumerable factors complicating if not tran- 
Scending efforts at scientific analysis. 'That is one reason why 
Such environmental variations as possibly can be brought under 
reasonable control in the examination process should be, 
through the application of methods earlier discussed. 
he comments just made in connection with the measurement 
of educational aptitudes and idiosyncrasies may seem rather 
Pessimistic. They are, however, intended merely to suggest that 
existence of such differences within many individuals (to vary- 
Ing degrees) is easier to demonstrate at a given time than it is 
Consistently over a protracted period. Certain distinctive fea- 
tures may appear, so to speak, in a “snapshot” ; will they re- 
Appear as markedly in another, taken a year or several years 
later? To carry our analogy farther, the biologist can obtain a 
Movie” of his fruit-fly mutations; while the educational psy- 
а ologist or test constructor at best can only hope for а few 
Stills? of each person. All too frequently he may take but one 
Snap, with his subject out of sight thereafter. The practical 
ifficulties of follow-up research, under comparable conditions 
and upon the same or closely analogous subjects, account for 
the paucity of data on "persistence of individual idiosyncra- 


sies.” Some evidence relative thereto will however be presented. 


Kelley on Idiosyncrasies 


In his classic volume, Interpretation of Educational Meas- 
urements, Truman Kelley (1927) devotes a chapter to this 
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topic and states his belief that *many idiosyncrasies have their 
roots in original nature." Two quotations therefrom will serve 
to illustrate his views: 

“The term ‘idiosyncrasy’ as used in this chapter, refers to 
differences in two abilities of a child as judged by comparison 
with the age or grade group in which he is located. If a child 
shows 10-year ability in reading and 12-year ability in com- 
putation, he is here considered to possess an idiosyncrasy. 'T'here 
is no moral obliquity attached to this peculiarity. 'T'his observa- 
tion is not uncalled for, in view of the endeavor of teachers and 
others to eliminate oddity practically wherever and whenever 
found." (Op. cit., p. 98). 

“Idiosyncrasy in a person, characterized, as it must be, by 
something that is superior as well as by something that is in- 
ferior, is like an unpolished gem in a crown of rough stones. It 
may become tarnished and lost from view so that the crown is 
always considered common, but if it is cut and polished to the 
degree that it alone of all the stones permits, it will then lend 
a dignity to the crown not noticeably excelled by one all of 
whose stones are brilliant.” (Idem, p. 100.) 

_ This somewhat flamboyant picture of one or more coruscating 
Jewels in each crown is vivid; the injunction to facet and polish 
each talent, moreover, implies confidence in stability of differ- 
ential aptitudes or—to carry on with Kelley’s metaphor—1n 
progressive enhancement of their radiance, with time and due 
educational care. His chapter just noted contains further in- 
teresting speculations as to the origin of personal educative 
idiosyncrasies and definitely characterizes them as “регвеуег- 
ant." Factual data on this important point are not offered by 
Kelley; as earlier stated, they are difficult to obtain and con- 
sequently rare from any source. 

. Numerous studies of the snapshot type have been made, eval- 
uating relative success in current identification of such differ- 
ences, both severally among and individually within students 
tested at a particular time. These are important for immediate 
guidance purposes and probably (though not assuredly) valid 
in the long run as well. Reference to these data will again be 
made, with unavoidable duplication of comments or citations, 
in later chapters of Part II dealing with specific-aptitude meas- 
ures; 1.е., present emphasis upon idiosyncrasies in the abstract 
sense will be followed in due course by consideration of their 
more concrete appraisal for diverse fields of study. As stated 
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above, considerably more evidence can be presented as to the 
existence of idiosyncrasies at any moment than has been col- 
lected thus far regarding their perseveration. Much of the dis- 
cussion concerning differential measures deals with observable 
current variations of talent within the individual; yet signifi- 
cance of these for prognosis naturally depends upon their sta- 
bility over a considerable period. The writer’s own view is es- 
sentially like Kelley’s cited above; i.e., that idiosyncrasies are 
real and should be cultivated rather than suppressed, and that 
individual differences are more important than educational con- 
formity. However, these opinions are speculative and admit- 
tedly difficult to substantiate. 


College Board Data on Individual Idiosyncrasy - 


Evidence as to differential aspects of performance on the 
College Board’s verbal and mathematical measures, respec- 
tively, is provided in Brolyer’s (1931-1935) notable work re- 
garding the extent and stability of these idiosyncrasies. He uses 
the term, much as Kelley did earlier, to denote variations within 
any person, though with particular respect to verbal and mathe- 
matical aptitudes as measured by separate parts of the Scho- 
lastic Aptitude Test. “By persistence of individual idiosyncrasy 
15 meant the extent to which an individual maintains at a sec- 
Ond time that difference between two test scores which he made 
at the first time.” (Op. cit., 1931, p. 193.) 
Idiosynerasy was calculated in terms of personal differences 
ebween verbal and mathematical standard (sigma) scores. A 
ecided tendency was found for most individuals to maintain, 
rather consistently, their relative position on one or the other 
Of these two test sections. For example, reported correlations 
ebween idiosynerasy measures, obtained from two separate 
test forms over a year’s interval, ranged between .75 and .88 
for different groups (Idem, p. 195). It is evident that such 
Instruments offer high promise of utility for differential meas- 
Urement along these contrasted lines. Findings reported by the 
College Entrance Examination Board indicate that the factor 
Segregated by certain (mathematical) sections of the original. 
AT, or by the later Mathematical Aptitude Test, differs 
Significantly from that measured by verbal materials, despite 
Moderate positive relationship between them. That varied some- 
What for different groups and years, but the coefficients re- 
Ported are circa .35 or less. 
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Variation in Stability of These Contrasted Measures 


Two forms of the CEEB verbal test taken by the same 
individuals a year apart (in 1930 and 1931) yielded correla- 
tions of .90 to .96, while analogous coefficients for the mathe- 
matical score ranged only from .70 to .74. Split-half reliabili- 
ties for both the verbal and mathematical tests were above .95 
(Idem, p. 197). It will be noted that (despite satisfactory in- 
ternal consistencies for both measures) progressive stability 
from one year to the next is decidedly less for the mathematical 
than for the verbal index. 

Further evidence of variant perseverance in aptitude test 
scores is presented in Table 19. The data represent students 
tested in junior year at their respective secondary schools and 
again, with the same or closely analogous aptitude measures; 
in their freshman year at Yale. Considering the time elapsed, 
ages of the subjects and environmental differences, the degree 
of test stability found seems reasonably high. However, & 
marked contrast is evident between such follow-up correlations 
of .82 for verbal comprehension and .56 for mathematical in- 
genuity. Tests dealing with other aptitudes seem to occupy ап 
intermediate position with quite consistent r values—.63 to 65. 
While these coefficients (which, in effect, measure retest relia- 
bilities) are not high, they are but slightly below parallel cor- 
respondence among measures of scholastic-grade reliabilities 
(cf. Chapter IV, p. 133). Moreover, they have all been some- 
what attenuated by the process of selection operating between 
preparatory-school and college matriculant groups. 


TABLE 19 


Summary of Stability Correlations Among Designated 
Aptitude Measures 


Tests Taken in Sec 7 4 Es 
Se аләт yd ane Sue Munere! Coi вина 
Fall of 1937 as Freshmen in 1939 Cases Respective Testing? 
Verbal Comprehension SAT (Verbal) 144 ps 
E Language Artificial Language 135 05 
erbal Reasoning Verbal Reasoning 143 :65 
Quantitative Reasoning Quantitative Reasoning 108 63 
Mathematical Ingenuity MAT (College Board) 108 56 
Spatial Visualizing Spatial (College Board) 144 64 


One probable reason for the МАТ% lower stability over 8 
period of time is that, while students are always required 1n 
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опе way or another to exercise the verbal factor, they vary to 
a considerable degree in the extent to which their correspond- 
ing mathematical or scientific talents are cultivated. These may 
soon become atrophied or at least rusty through comparative 
disuse (as many educators, for example, who once knew their 
calculus have found to considerable sorrow in trying to under- 
stand present-day vector analysis or to pursue Professor Thurs- 
tone through n-dimensional space). In psychological terms this 
merely represents the classical law of forgetting, in the ab- 
Sence of steady reinforcement. 

Е Except for persons in a field of study or vocation which con- 
tinuously demands quantitative thinking, once-acquired mental 
capacity of this type is likely to deteriorate. An individual thus 
may soon forget even the meaning of quantitative symbols and 
the processes of manipulating them; still more so the sheer 
mechanics or theorems he once knew. Except for analogous loss 
of a disused foreign language, no corresponding possibility ex- 
ists of neglecting his verbal facility; through reading, study, 
teaching or writing in any field (not to mention faculty meet- 
ings, student *bull sessions" and everyday conversation), word 

чепсу plays an ever-active part. 

These generalizations perhaps apply less directly to high 
School and college freshman situations and the comparatively 
short space of time they embrace. Yet if valid for a longer span, 
the foregoing data may suggest by analogy why “persistence 
of individual idiosyncrasies" (as between verbal and mathe- 
matical promise, for example) is somewhat obscured by the lat- 
ter’s relative instability. That indeed may simply reflect greater 
Variances in the exercise of mathematical-scientific, than of 
verbal-linguistic, talents within typical (prewar) studies of the 
academic type. 

The writer’s hypothesis, as yet quite untested, is that more 
Consistent performance on verbal, as compared with mathe- 
matical or certain other aptitude tests, is primarily attributable 
not to shortcomings within the latter type of measure, but 
rather to variant emphases placed in secondary school and col- 
«е Upon other than verbal talents (which always have their 

aily-dozen exercise”). It clearly seems more questionable in 

€ case of mathematical than of verbal materials to assume 
an equalization of previous opportunities for practice and 
Study among the various individuals examined. Tests can, how- 
ever, be designed to circumvent that particular difficulty, by 
Constructing aptitude items in such a way that variations in 
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learning or forgetting mathematics—even within the compara- 
tively short span of two or three years between high school and 
freshman classes—are at least minimized. 


SUMMARY 


Тһе present chapter, reviewing fundamental principles of 
test construction, measurement techniques and the problem of 
individual idiosyncrasies, brings to a close Part I of this study. 
"Therein we have attempted to provide a general background 
for consideration in Part II of basic educational aptitude meas- 
ures and those of more advanced and complex nature. Relative 
promise for advanced professional studies, and tests of a more 
subjective nature (i.e., of interests, preferences and personal- 
ity traits) will subsequently be discussed in Part III. 

Attention to the principles and techniques mentioned above 
is vital with respect to all forms of human measurement or their 
applications in personal guidance. It must be recognized that 
upon them true effectiveness of prognostic instruments, what- 
ever their nature, will at any time rest. Preliminary efforts m 
the appraisal of general intelligence led to others, more differ- 
ential in aim, based upon the objective testing of specific 
achievements and subsequently of broader aptitudes. Mean- 
while research had been pressed toward the identification of 
even more unitary or primary mental traits and the analysis 0 
new data from widely extended sources. Various problems re- 
lating to the measurement, description and useful direction of 
human abilities have been growing quite actively throughout 
more than the past decade. Happily, a process of cross-fertili- 
zation among these several efforts has produced valuable and, 
to some degree, interlocking results. 

: Certain functional areas of differential scholastic aptitude 
seem to have been more or less well defined: e.g., verbal, lin- 
guistic, mathematical, scientific (quantitative reasoning), SP% 
tial and mechanical ingenuity. Areas of concentration at the 
higher educational levels (college majors) usually involve sev- 
eral of these aptitudes, and earlier “distribution requirements 
may to ponte degree call upon most of them, except for spatia 
and mechanical. The latter have a unique importance in respec 
to engineering (a complex field, as we shall later see), archi- 
tectural or certain other technological studies which require 
facility in three-dimensional visualization, and sometimes 9 
ready grasp of mechanical movements as well. 


Practically, it would seem, such groupings or combinations of 
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educational aptitude indices can better be determined in rela- 
tion to established organization of our school and college cur- 
ricula, which reflect our culture, than by the more theoretical 
investigation of Primary Mental Abilities or Unitary Traits. 
"These may, in a pure psychological sense, eventually become 
most important; their useful applicability to everyday prob- 
lems of student guidance has yet to be demonstrated. 


APPENDICES 


APPENDIX A 


PRACTICE BOOKLET FOR YALE EDUCATIONAL 
APTITUDE BATTERY 


The following material has been abstracted from the practice booklet 
and is presented for illustrative purposes. Though not reproduced 
here in full, the booklet contains detailed instructions for marking a 
separate answer sheet, and a larger number of examples. 


TEST I—VERBAL COMPREHENSION 


Part I: Paragraph Reading. 

DIRECTIONS: Each sentence or paragraph below contains one word 
which spoils its meaning. This incorrect word is one of the five words 
which have numbers printed just above them. You are to find the 


incorrect word and strike out its number. 
Answers 


2 
Example: “Тһе sale of fur in the tropics is an important 


у 3 4 " ; 
business, made so by the lack of need for clothing which 
А 5 
Will ward off the cold." 12845 
Тһе meaning is spoiled by the word “important,” which 


is number 9 of the five words. Therefore, number 2 is the 
Correct answer, as indicated. 


Now, answer these problems in the same way. 
1 E H 
1. The rapid increase of natural knowledge, which is the 
2 
chief characteristic of our age, is brought about in vari- 


3 
ous ways. The main army of science moves to the con- 


4 


5 
quest slowly, never ceding an inch of the territory lost. 1 2 3 4 5 


H + 
China, and much later, Western Europe and the United 
9 8А Р 
States, invented systems of examinations which ad- 
F 4 
mitted the unsuccessful candidates to one or another 


5 
kind of preferment. 12345 
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Part II: Word Relations. 
DIRECTIONS: In this test, the symbol # will be used to indicate oppo- 
site in meaning to. For example: 

Find noun # Verb “grieve”: 1-help, 2-joy, 3-sense, 4-charity, 
5-image 
This means that you are to find a noun among the five words given 
which conveys a meaning opposite to that conveyed by the verb 
“grieve.” The answer is joy," number 2. 12345 


Work these problems: 


1. Find noun # Verb 1-punishment, 2-open, 
“chastise” 3-praise, 4-movement, 
5-curse 12845 
9. Find ADVERB # 1-falsely, 2-costly, 
Adjective “sufficient” 3-carefully, 4-illegally, 
5-inadequately 12845 


Part III: Synonyms. 


DIRECTIONS: In each line below, the word in caprrat let- 
ters is followed by two words in smaller letters. Sometimes 
only the first of these two words means the same, or nearly 
the same, as the capitalized word, and in this case the 
answer is number 1. If the second word only is a synonym, 
number 2 should be indicated. Likewise, if neither word 
means the same as the capitalized word, the answer is 
number 8; and if both are synonyms of the capitalized 


4. GARRULOUS 


8 
word, the choice is number 4. The first two are correctly 2 E = СЕ 
answered. ü ФР 

1. ABSURD ridiculous tedious 1845 
2. GORGEOUS intricate eccentric 12 3 4 
3. IMMUTABILITY probability ^ inability 1 de 

1284 


о о 


loquacious ornate 
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TEST II—ARTIFICIAL LANGUAGE 


"This test is an attempt to measure the ability underlying facility in 
learning new languages. 

DIRECTIONS: Study the vocabulary and rules of the artificial lan- 
guage given below. Then work the practice sentences in the order in 
which they appear. 


VOCABULARY: I—vlu tobe —jahviz good—zeyt 
he, it (nom.) to read —skraliz book—stetsleit 
—wes to have—dromiz word—gleit 
RULES: 1. Articles are not used in the artificial language. 


9. Verbs are not conjugated for person and num- 
ber. e. g., jahviz is used for am, are, is. 
3. Future—prefix bii to the verb. 
e. g., to read—skraliz; will read—bliskraliz. 
Word order—as in English where possible. 


SAMPLE EXERCISE: A B C Answers 
I have a book 
A. (1) Wes (2) Роми (8) Уа (4) Polwes (5) Vlul 


1234 
B. (1) dromiz (2) jahviz (3) amdiz (4) somiz (5) binotiz 1934 
C. (1) gleit _ (2) zepoldeit (3) zeyt (4) stetsleit (5) oveit 1234 


oo 


The correct translation of the sample sentence is: Уи dromiz stets- 
leit. The correct choice for “А” is then (3); for "B" it is (1); for 
"C" it is (4). 


PRACTICE SENTENCES: 


А B с 
1. Vlu jahviz деуі. 

А. (1) Не (Т (8) We (4) It (5) toread 12345 
В. (1) read (2) have (3) willbe (4) willhave (5) am 12345 
C. (1) book (2) word (8) he (4) good (5) it 18845 

А В С 

9. The book will have а word. 

A. (1) Gleit (2) Үш (8) Stetsleit (4) Wes (5) Zeht 12345 
B. (1) Blidromiz (9) jahviz (3) dromiz (4) blijahviz (5) skraliz 12345 
C. (1) zeht (2) gleit (3) wes (4) stetsleit (5) үш 12345 
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TEST III—VERBAL REASONING 


This is a test of ability to think logically, i. e., to arrive at valid con- 
clusions and work out relationships from given data. 


Part I: Logical Inferences. 

DIRECTIONS: Each of the following questions has two parts—a state- 
ment of fact which is always assumed to be true, and a conclusion 
from that statement of fact. Five possible judgments can be made for 
each conclusion. They are as follows: 


1. Necessarily true 
. Necessarily false 
. Probably true 
. Probably false 
. Undetermined. 


CU e со t 


You are to examine each question carefully, make your judgment of 
the conclusion and then record the number of your judgment in the 
regular manner. Judgment number 5 (undetermined) is to be used 
when you believe that the statement of fact does not give you suffi- 
cient information to enable you to judge the conclusion at all. In each 
case, assume that the statement of fact is true. 
Answers 

EXAMPLE: Most thunderstorms are accompanied by light- 

ning, rain and a high wind. 

Conclusion: There will be a high wind with our ; 

next thunderstorm. 12345 


Work the problems below: 


1. John is older than Jim. Jim is older than Bob. Bill is 
older than Bob. 


Conclusion: Bill is older than John. 12845 


9. It is known that only five persons have had this book 
and it is improbable that four of them would have 
written these new marginal notes in it. 
Conclusion: James, the fifth person, wrote these notes. 1 2 9 45 
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Part II: Interpretation of Experiments. 

Although this test deals with scientific observations, it assumes no 
previous study of the subject-matter involved. Each question itself 
provides all information necessary for selection of the correct an- 
swers. It is therefore a test not of knowledge but of ability to reason 
logically from the facts presented and of judgment in forming con- 
clusions. Note that the "evidence" presented is hypothetical and 
тау be contrary to fact. 


DIRECTIONS: In each of the following exercises, certain data are pre- 
sented. Below the descriptions of the data are several statements 
which have been suggested as possible interpretations. Assume that 
the facts given in the description and in the results are correct. Then 
on the basis of these facts only, consider each statement. One of five 
possible judgments can be made for each interpretation: 


1. The evidence is sufficient to make the statement necessarily true. 

9. Тһе evidence is sufficient to make the statement necessarily false. 

3. Тһе evidence suggests that the statement is probably true. 

4. Тһе evidence suggests that the statement is probably false. 

5. "There is insufficient evidence to make a decision concerning the 
Statement. 


EXAMPLE: In studying the habitats of red maple trees, they were 
found growing only in swamps, along rivers and in bogs. In studying 
the habitats of American elm trees, they were found growing only in 
Swamps, along rivers and in bogs. In all of these different habitats, 
the leaves of the maples were always opposite on the branches and 
the leaves of the elms were always alternate on the branches. 
, Answers 

INTERPRETATIONS: 
à. The habitats in which the two kinds of trees grew did 

not affect the position of leaves on the branches. 19345 
b. A certain amount of water was necessary for both kinds 


of trees to grow. 
€. American elms were affected more by the environment 


than were red maples. 
d. The leaves were always opposite on the branches of 


American elm trees. 
Cedar trees are also found growing in swamps, along 


Tivers and in bogs. 


12345 
19845 


12345 


Ф 


9345 
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Part III: Word Analogies. 

DIRECTIONS: Ín each of the following items, notice the relation be- 
tween the first two words, then cross out the numbers of the Two 
WORDS having most nearly the same relation as the two given words. 
Remember to cross out the numbers corresponding to TWO WORDS, 
as no partial credit is given for one number correctly marked. 


5 Answers 
EXAMPLE: execution: lynching (1-огдег, 2-command, 

S-obedience, 4-society, 5-lawlessness, 6-sav- 

age) 128456 


Work the problems below: 
l. marrow: bone (1-fist, 2-gist, 3-boxing, 4-argument, 
5-millstone, 6-neck) 123456 


9. infringement: copyright (1-sin, 2-trespass, 3-wrong, 
4-faith, 5-property, 6-statutes) 128456 
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TEST IV—QUANTITATIVE REASONING 


This is a test of ability to think logically and to arrive at conclusions 
through processes of induction and deduction from quantitative data. 


Part I: Discovering Principles. 

The student should note that in this part he must write out his an- 
swers rather than select them from a group of possibilities. 

You are to examine imaginary series of observations. You will then be 
asked to draw certain conclusions from these observations and to dis- 
Cover relationships and laws. 

In principle, these problems are similar to those which confronted our 
earliest scientists. Since these observations are imaginary, however, 
the relationships and laws which can be discovered from them are not 
those with which scientists are actually familiar. Therefore, success 
on this test does not depend upon knowledge of sciences such as 
Physics and chemistry, but rather on your ability to study observa- 
tions and draw logical conclusions from them. 


To fix these ideas in your mind, study the following easy example: 


Observations 
A B 
6 12 
9 18 

17 54 
7.5 15 
8 16 


What is the value of B when A equals 13? Answer: 26. 


EXPLANATION: It is quickly seen that each number in column B 
is exactly double the corresponding number in column A. Therefore, 
when A is 13, B must equal 26. Note that the simplest way of stating 
this relationship is in the form of an equation. When thus stated, it 


becomes : 


B=2A ог A= 


юш 
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Work the problems below: 
A B 1. When A - 79, what is the value of B? 
8 2 
82 4 2. When B = 0.7, what is the value of А? 
18 3 
200 10 3. State the formula in algebraic terms. 
50 5 


(Though not stated in the Practice Booklet, this formula is: А — 2B?) 


Part II: Number Series. 


The numbers in each problem are arranged according to some ont 
ticular scheme. The following example shows you how to proceec. 
You are to indicate the numbers which come next in the series. 


T 14 98 56 112 224 448 
Work the problems below: 

als 1 3 5 7 9 == E 
9, 2 4 8 16 92 == == 
8. 1 4 9 16 25 == ES 


Part III: Relationships. 


Study the symbols below. They are commonly used to stand for cer- 
tain relationships. 


= means “‘is equal to" 
< means “‘is less than” 


> means “is greater than” 


Using the information under crven racrs, find the symbol which 
expresses the most exact relationship between the two letters under 
CONCLUSION. Then cross out the number corresponding to that sy™- 
bol. Work the problems below. The first one is done correctly. 


GIVEN FACTS 


CONCLUSION 
= < 2 
LA-B;C-B therefore A ON © 
2. A=B;C<B therefore A 1 2 3 С 
38.A>B;C=B therefore A 1 2 3 C 
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TEST V—MATHEMATICAL INGENUITY 


This is a test to measure aptitude in mathematics. 


Part I: Solve for x. Answers 
4 2 
L Sez x=? 01234 
x g 
3. 
& ак 27 a8 х=й 1071,% 8/16 
х?у 
Part II 


In the problems below, you are given a sentence which expresses a cer- 
tain relationship, with five equations written after it. One of these 
equations correctly expresses the relationship stated in the sentence. 
Find this equation and mark its identifying key number. 


Answers 
1. The area (А) of a rectangle is equal to the product of 
the two sides (sı and s2). 

()А-%Фі» (А-а +s: (8) A = SiS 

(4) А == (5) А = % 19345 
9. The energy (E) of a moving body is equal to one-half 

the mass (m) times the square of the velocity (v)- 

0)E-43Qv? @) Е = ішу (8) E = ашу? 

(4) E = mv? (5) Е- àm 19845 
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Part III: (Note: Figures in this section are not necessarily drawn to 


scale.) 

Given: AB = AC = AD Answers 

Radius OX = 4/3 

The three circles Hence: AB=? 1 2 3 v a 
have equal radii (1) (115) (1.78+) (1.75) (2) 


Part III of the secondary school Mathematical Ingenuity Test repre- 
sents a learning problem which cannot well be illustrated in advance. 


TEST VI—SPATIAL RELATIONS 


Part I: Each pile of blocks below has been made by gluing together 
CUBES of the same size. After being glued together, each pile was 
painted on all sides except the bottom. You are to examine each figure 
and determine now MANY CUBES have, respectively: none of pros 
sides painted; just one of their sides painted; just two of their pi 
painted; just three of their sides painted; just four of their sides 
painted; just five of their sides painted. 
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EXAMPLE: 


Figure 0 Figure 1 


In Fig. 0, how many cuazs have: Answer 


none of their sides painted? 

just one of their sides painted? 
just two of their sides painted? 
just three of their sides painted? 
just four of their sides painted? 
just five of their sides painted? 


Ael: 


EXPLANATION: There are four cubes in Figure 0. Cube А has just 
five painted sides. Similarly, Cube B and Cube C have just four 
Painted sides each. Cube D, hidden from sight by the others (under- 
neath Cube A), has only two painted sides. There are, then, no cubes 
With none of their sides painted, nor are there any with just one or just 
three painted sides. There are, however, one cube with two painted 
Sides, two with four, and one with five, and the answers would be as 
shown. 


PROBLEM: In Fig. 1, how many cuBEs have: Answer 


none of their sides painted? 

just one of their sides painted? 
just two of their sides painted? 
just three of their sides painted? 
just four of their sides painted? 
just five of their sides painted? 


S» Өс Дек On Re 


ІШ 


Forecasting College Achievement 


264 


MIA LNOUJ 
МЭЛ ONT MUA 1М094 MIA ONI ИЛЛ ON ЕЛЕШЕЛЦІЕ 


мем ЕЛЕР ІІ" 


MEJA 


МЗИ 401 


MIN SOL MIIA SOL 
a 2 g V 


"ротор әле мәт fuv ur оорут 29) uo uoos әд YOUU YY} seurT (Т pu гу AMI YOO] proa sməra eq оү er ш AOT 
o1vnbs v әләм O10} JJ “Ef OMY] ш плоце st мога әлцоәйѕдәй [vuorjuoAuoo ә, oq ә|4ппв v jo mara orqde12oqj10 
Uv sr y on3ip 'suorisod oures oq; ur 8/00 әле SMAIA oso, "1ou109 PULY-IYSI ломој IY} ш YMOYS st puo oq] шош 31 
ye Вигјоој uorjoefo1d oq рит :1ou100 puvq-3Jo[ ломој oq; ш имоц st 2u04f әт uro1j 11 үе Zurqoo[ попооГола oq; :1901020 
puvq-3jop toddn oq ш uaoqs st 31 uo umop Sur[oo[ yoalqo oq Jo uorjoefod oq ‘st WYJ, “5329740 рЦов snorreA. JO SMOTA 
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TEST VII—MECHANICAL INGENUITY 


DIRECTIONS: Read the following description of the diagram and then 
answer the questions. 


DESCRIPTION: Pulleys À and B are jointed firmly so that they turn 
together. XY is an elastic belt. C is the driven pulley. 


1. In order to make'C revolve in the direction opposite from that 
Which is shown, would you 


Answers 
(1) put the belt around А? 
(9) reverse one end of belt? 
(3) leave it as it is? 


2. If B is larger than C, which turns faster? 


а) B 
(2) С 
(8) Both turn at the same rate 
3. If belt XY were replaced by a belt connecting 
А and C, will C move 


(1) faster? 
(2) slower? 
(3) at the same rate? 


128 


1238 


123 


APPENDIX B 


Data as to Means and Standard Deviations, relative to the corre- 
lation tables (9, 10 and 11) in Chapter V, are here presented. These 1 
in turn are followed by additional data, Correlations Involving The 


Sir Primaries and The General Factor, relative to discussion in 
Chapter VI. 


TABLE I 


Recapitulation of Table 9, p. 161, with Certain Additional 
Coefficients of Correlation between Aptitude Test Scores and 
First Term Grades, for the Yale Classes of 1944 and 1945 
Aptitude Tests 
No. of 
First Term Course’ Class Stu I П IH W YV VI VII 
dents SAT ALT VRT QRT МАТ SVT MIT 


General Aver- 1944 829 .38 .33 37 81 99 540 24% 


age 1945 969 49 .39 89 .29 28  .19 19 
English 10 1944 668 .30 .10 21 (08 .03 .08 .00 

1945 832 80 20 91 ло (06 .09 .08 
History 10 1944 201 .33 91 97 19 90  .09 208 


1945 229 .33 17 97 18 18 .06 .0 
АП English & 
History Grades 1944 990 49 .34 40 99 16 .22 17 
Averaged 1945 986 44 .95 40 .22 18 11-01 


German 10 1944 96 91 .38 97 098 96 16 «14 
1945 196 .39 41 33 .30 96 17 46 


Spanish 10 1944 62 46 57 .40 49 41 —.07 44 
1945 145 15 44 17 15 14 9 —.07 


Physics 10 19044 55 40 (57 52 .96 45 40 24 
1945 6% 31 43 45 40 .1 .85 -40 
Mathematics 19 1944 291 оз 91 10 98 .32 11 07 
1945 394 оз 38 85 .39 49 .26 .30 
Engineering 1944 202 11 07 өз 41 15 55 42 
Drawing10 1945 208 19 99 24 30 97 56 47 
All Mathematics 


& Drawing 194 246 16 әз 94 49 42 46 37 
Gender Aver- 1945 986 25 32 'so 43 46 42 44 
age 
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'TABLE II 
Summary of Means and Standard Deviations Relative to 
T'able I 
Aptitude Tests 
" No. of Range of 
First Term Course Class Stu- Grades* Range of Standard 
dents Mean SD Means** Deviations 


General Average 1944 829 2.2 .7 55.0-558 8.7- 9.0 
1945 969 22  .7 55.0-557 8.9- 9.4 


English 10 1944 668 2.0 .7 584-548 80-89 
1945 882 2.0 7 54.0-55.2 8.1- 8.9 

9 

9 


50.8-59.6 7.9- 8.4 


History 10 1944 901 21 
514-546 7.9- 93 


1945 299 22 


АП English & 
History Grades 1944 290 22  .7 51.9-55.4. 8.4- 9.4 


Averaged 1945 986 22 7 516-551 89-98 
German 10 1944 96 94 10 560-578  84- 9.7 
19045 196 92 10 574-587 88-94 
Spanish 10 1944 62 91 19 59.7-56.8 8.2-101 
1945 145 19 14 537-550 8.2- 9.6 
Physics 10 1944 55 90 10 547—578 7.6- 9.9 


1945 62 18 19 56.0-57.3  8.4-10.5 


Mathematics 12 1944 291 9.0 1.0 542-592 6.1- 8.5 
1045 394 90 12 546-598 7Л- 9.2 


Engineering 1944 202 21 11 539-601 70-84 
rawing 10 1945 293 98 10 588-582 80-929 
АП Mathematics 


59.9-59.7 7.2- 8.2 


rawing 1044 946 91 
58.8-58.7 8.0- 9.1 


тадез Averaged 1945 986 2.2 


оо 


* Variations іп the Means and Standard Deviations of course grades reflect differ- 
e both in composition of groups electing these courses, and in departmental grading 
ndards, 
x Range of Means for aptitude scores reflects differences in performance on tests 
т шргізіпр the Yale Battery, among students electing the several courses designated. 
hese data represent all students electing and completing one term of each course. 
1 ence the Range of Means throughout is less here, than for entrants classified accord- 
28 to pre-matriculation choice as illustrated by group profiles of prospective academic 
tai engineering candidates (p. 144, Chapter V). Yet a substantial range in test-score 
Means is evident among all students electing Mathematics and Engineering Drawing 
“Many of whom are not in the pre-engineering group. 
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TABLE III 


Means and Standard Deviations of the Several Navy У—18 
Groups Represented in Table 10, p. 165 


Date of No. of First Term Navy Achievement Tests 
Entrance Students Faculty Grades (Normalized йе Scores)* 
M SD M SD 
General Average Total Achievement 

July 333 79.8 7.6 15.6 3.5 

November 148 72.2 7.6 14.9 3.6 

March 237 79.7 7.5 14.0 3.9 
English English 

July 898 73.1 6.2 15.1 3.6 

November 148 747 5.8 14.3 3.9 

March 918 79.7 7.4 18.6 4.0 
History History 

July 897 74.1 8.4 15.2 3.7 

November 146 76.3 8.6 15.0 3.9 

March 917 16.8 8.5 13.8 9.9 
Physics Physics 

July 333 69.2 12.3 15.0 3.5 

November 151 65.9 12.4 14.7 3.4 

March 235 74.0 9.8 14.2 3.8 

Mathematics Mathematics 

July 332 71.5 11.7 15.6 8.5 

November 146 ү! 11.9 14.6 3.5 

March 233 70.4 191 13.9 4.0 


iy n 
The normalized scale to which achievement test percentile scores were trans- 
muted is a twenty-one in 


Е i terval scale ranging from 3 to 23 inclusive. The median 
stent is 13. See р. 167, Chapter V for explanation of how the Navy Percentile 
cores were "normalized" and converted into standard deviation equivalents. 
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Means and Standard Deviations of the Several Navy V—12 
Groups Represented in Table 11, p. 168 


Date of 
Entrance 


July 
November 
March 


July 
November 
March 


July 
November 
March 


July 
November 
March 


July 
November 
March 


Achievement 
Test 
(normalized 


ile 


basis)* 
M SD 


Total Score 
15.6-3.5 
15.0-3.6 
14.0-8.9 


English 
15.0-3.6 
14.3-3.9 
13.6-4.1 


History 
15.2-3.6 
15.0-3.9 
13.2-4.5 


Physics 
15.0-3.5 
14.6-3.4 
14.9-3.8 


Mathematics 
15.9-3.2 
15.0-3.3 
14.0-4.1 


No. of 
Students 


335 
155 
935 


335 
155 
235 


339 
149 
235 


335 
155 
235 


310 
132 
235 


Aptitude Tests 


Range of 
Means** 


46.4-51.4 
44.9-53.5 
49.7-58.1 


46.4-51.4 
44.9-53.5 
42.7-53.1 


46.1-51.1 
44.0-53.5 
42.7-53.1 


46.4-51.4 
44.9-53.5 
42.7-53.1 


47.1-51.8 
44.8-53.9 
42.7-53.1 


Range of 
Standard 
Deviations 


7.1-10.9 
8.3-10.5 
9.3-10.3 


7.1-10.9 
8.3-10.5 
9.3-10.8 


7.4-11.0 
8.3-10.3 
9.3-10.3 


7.1-10.9 
8.3-10.5 
9.8–10.3 


7.1-10.2 
8.3- 9.9 
9.3-10.3 


*The normalized scale to which achievement test percentile scores were trans- 
muted is a twenty-one interval scale ranging from 8 to 23 inclusive. The median 


Interva] is 18. 


** See first footnote to Table 10, p. 165, with reference to difference in means of 
Ше July 1943 as compared with later groups, on the Mechanical Ingenuity Test. 


